Merge ADO latest main into github (#17)

* Merged PR 6148: Services that create pipelines * `OnemlProcessorsPipelineOperationsServices.COLLECTION_TO_DICT` returns a pipeline with an in_collection and an output that exposes the collection as a dictionary. * `OnemlProcessorsPipelineOperationsServices.DICT_TO_COLLECTION` returns a pipeline with a dictionary input and an out_collection the entries of the input dictionary as collection entries. * `OnemlProcessorsPipelineOperationsServices.DUPLICATE_PIPELINE` takes a pipeline and returns a pipeline with multiple copies of it. * `OnemlProcessorsPipelineOperationsServices.EXPOSE_GIVEN_OUTPUTS` takes data (dict of outputs, dict of dict of out collections) and creates a pipeline that exposes that data as output. * `OnemlProcessorsPipelineOperationsServices.EXPOSE_PIPELINE_AS_OUTPUT` takes a pipeline, returns an identical pipeline except it has an additional output exposing the given pipeline. * `OnemlProcessorsPipelineOperationsServices.LOAD_INPUTS_SAVE_OUTPUTS` takes a pipeline with an in_collection called `inputs_to_load` and an out_collection called `outputs_to_save` and returns a pipeline that loads the requested inputs from uris, passes them to the original pipeline, then saves the requested outputs to uris. * `OnemlHabitatsPipelineOperationsServices.PUBLISH_OUTPUTS_AS_DATASET` takes a pipeline that loads and saves to/from uris and returns a pipeline that loads from datasets and publishes a dataset. * Merged PR 6177: allow from oneml.habitats.immunocli import OnemlHabitatsCliDiContainer b/c cl... allow from oneml.habitats.immunocli import OnemlHabitatsCliDiContainer b/c cli di containers should be exposed * Merged PR 6178: fix service method and test that the service is added * Merged PR 6184: fixes to https://immunomics.visualstudio.com/Immunomics/_git/oneml/pullrequest/6148 * Merged PR 6187: Generic pipelines with two arguments only * Merged PR 6193: Minor fix regarding instantiation order of namedcollection. * Fix. * Tetst. * Merged PR 6206: test load_inputs_save_outputs when pipeline already has input_uris and output... test load_inputs_save_outputs when pipeline already has input_uris and output_uris * Merged PR 6225: fix services in oneml.habitats.pipeline_operations._datasets fix services in oneml.habitats.pipeline_operations._datasets * Merged PR 6232: remove furl from oneml.habitats.pipeline_operations b/c it lower-cases datase... remove furl from oneml.habitats.pipeline_operations b/c it lower-cases dataset name in ampds:// * Merged PR 6234: fix register io for Manifestaon abfss * Merged PR 6236: fix register io for Manifest on abfss * Merged PR 6239: fixed ComputeNodeBasedOutputUriProcessor intoducing uris with empty path segments * Merged PR 6231: Support pipeline providers in YAML * Refactors plugins into every component. * Adds hydra registry for pipeline providers. * Adds two_diamond tests building a pipeline in YAML using a python pipeline provider. * Merged PR 6255: expose BLOB_READ_USING_LOCAL_CACHE_FACTORY and BLOB_WRITE_USING_LOCAL_CACHE_F... expose BLOB_READ_USING_LOCAL_CACHE_FACTORY and BLOB_WRITE_USING_LOCAL_CACHE_FACTORY for use by other packages * Merged PR 6256: Add ability to duplicate pipeliens in YAML. * Adds `DuplicatePipelineConf` to be able to generate duplicate pipelines in YAML. * Adds three diamond as test case. * Merged PR 6250: oneml.processors.registry of pipeline providers * Merged PR 6279: 📝 (docs) Update documentation. 📝 (docs) Update documentation. * Merged PR 6286: fix container name in DatasetBlobStoreBaseLocationService DatasetBlobStoreBaseLocationService provides the blob locations for datasets published by oneml. It should read these from an installation level configuration, but at the moment has them hard coded. The container name used from production and non-production datasets outside notebooks is wrong. This PR hopefully fixes it. * Merged PR 6288: Fix dataset blob location * Merged PR 6293: fix get_relative_path when base_uri ends with / fix get_relative_path when base_uri ends with / * Merged PR 6298: 📝 Update docs. 📝 Update docs. * Merged PR 6311: Service inheritance test add test that verifies mypy screams when a service id is associated with a service that is a super-type of the declared service type * Merged PR 6310: 🚸 Optional inputs when combining pipelines can be dropped and add tests. 🚸 Refactor pipeline input validation for optional inputs and add tests. * Merged PR 6312: ✨ Add pipeline drop_inputs and drop_outputs and tests ✨ Add pipeline drop_inputs and drop_outputs methods * Merged PR 6322: test publish_outputs_as_dataset when there are no input_uris test publish_outputs_as_dataset when there are no input_uris * Merged PR 6331: Add support for symbolic links in uris Datasets published by oneml pipelines have a manifest.json at their root that maps output names to relative paths within the dataset. Pipeline builders can read from uris using `OnemlProcessorsIoServices.READ_FROM_URI_PIPELINE_BUILDER`, which by itself uses `ReadFromUriProcessor`, that takes a uri and returns the read object. This PR adds the following semantic to uris: If the uri has a fragment (i.e. something that follows #) then it is assumed that (after removing the fragment) the uri points to a json file. The fragment is assumed to be a dot-seprated hierarchical key into the json. The value associated with the key should be either a relative path from the directory of the json, or an absolute uri. Examples: if `file:///path1/path2/index.json` holds: ``` links: rel: path3/array.npy abs: ampds://mydataset/ another_abs: ampds://mydataset/manifest.json#entry_uris.container1?namespace=mynamespace ``` and `ampds://mydataset/manifest.json` holds: ``` entry_uris: container1: containers/container1 ``` Then: * `file:///path1/path2/index.json#links.rel` becomes `file:///path1/path2/path3/array.npy` * `file:///path1/path2/index.json#links.abs` becomes `ampds://mydataset/` * `file:///path1/path2/index.json#links.another_abs` becomes `ampds://mydataset/containers/container1?namespace=mynamespace` The semantics are implemented in `ReadFromUriProcessor` and therefore applicable only to pipelines built directly or indirectly using `OnemlProcessorsIoServices.READ_FROM_URI_PIPELINE_BUILDER`. * Fix mypy and tests after merge. --------- Co-authored-by: elonp <[email protected]> Co-authored-by: jzazo <[email protected]>
microsoft · Jan 4, 2024 · c4cc4dc · c4cc4dc
1 parent a6e6817
commit c4cc4dc
Show file tree

Hide file tree

Showing 148 changed files with 6,917 additions and 2,253 deletions.
diff --git a/docs/user_documentation/faq.md b/docs/user_documentation/faq.md
@@ -5,9 +5,10 @@
 ### What is OneML?
 
 OneML is a framework to design and run pipelines for ML Research.
-Users organize their code writing classes which include ML functions and specify inputs and
+Users organize their code writing classes which encode ML functions and declare inputs and
 outputs.
-Users design pipelines linking these functions together and can run them.
+Users construct pipelines connecting these inputs and outputs into a DAG.
+Pipelines can be run locally or in a cluster.
 
 
 ### Why is OneML different from other ML pipelines tools?
@@ -41,5 +42,5 @@ functionality, and where ML concepts are built in to facilitate validity and ana
 
 OneML provides a framework to connect code organized in classes, abstract the complexities from
 running distributedly in multiple nodes, and facilitates organizing data around train and
-evaluation splits, supporting cross-validation, hyperaparameter optimizatino, and other pipeline
+evaluation splits, supporting cross-validation, hyperaparameter optimization, and other pipeline
 wrappers that are commonly used.
diff --git a/docs/user_documentation/index.md b/docs/user_documentation/index.md
@@ -5,15 +5,22 @@
 
 - **DAG**: Directed acyclic graph. We use the term as synonym of *pipeline*.
 
-- **Processor:** A class object with a *process* method. Implments on operation taking inputs and producing outputs.  The signatures of the processor's constructor and process method define the inputs and outputs of the
+- **Processor:** A class object with a *process* method. Implements on operation taking inputs and
+producing outputs. The signatures of the processor's constructor and process method define the
+inputs and outputs of the pipeline's *node*.
 
 - **Node**: Vertex, or compute point, of a *pipeline*.
-Nodes hold references to *processor* classes, input & output parameters, constructor arguments and compute requirements.
-During orchestration, *processors* are instantiated and run from nodes encoded in the *pipeline*.
+Nodes hold references to *processor* classes, input & output parameters, constructor arguments and
+compute requirements. During orchestration, *processors* are instantiated and run from nodes
+encoded in the *pipeline*.
 
-- **Dependency:** Directed edge from one compute *node* output to another compute *node* input.  The output of one *processor* feeds into the input of the next *processor*.
+- **Port**: Input or output of a *node*. *Output ports* are connected into *input ports* forming
+a dependency, where the output of a *processor* is given as input to the receiving *processor*.
+
+- **Dependency:** Directed edge from a *node's* output port to another *node's* input port. 
+The output of one *processor* feeds into the input of the next *processor*.
 During orchestration, outputs are passed as inputs to the depending nodes.
 
-- **Task:** A single node pipeline, created by wrapping over a Processor.
+- **Task:** A single node pipeline, created by wrapping over a *processor's* class.
 
 - **TrainAndEval:** A classical ML pipeline, taking training data and eval data, learning a model using the training data, and using it to provide predictions on both training and eval data.  OneML provides tools to help build and manipulate TrainAndEval pipelines, including automatically persisting a fitted eval pipeline that can take other eval data and provide predictions.
diff --git a/docs/user_documentation/tutorial_advanced.md b/docs/user_documentation/tutorial_advanced.md
@@ -5,36 +5,32 @@ and build your own meta-pipelines.
 
 #### Table of Contents
 
-- [Operations with Parameter Entries and Collections](#operations-with-parameter-entries-and-collections)
-  - [`InEntry` & `OutEntry`](#inentry--outentry)
+- [Operations with InPort/OutPort and Inputs/Outputs](#operations-with-inportoutport-and-inputsoutputs)
+  - [`InPort` & `OutPort`](#InPort--OutPort)
     - [Left and right shift operations](#left-and-right-shift-operations)
-    - [Merge `InEntry`](#merge-inentry)
-    - [Merge `OutEntry`](#merge-outentry)
-  - [`Inputs` & `Outputs`](#inputs--outputs)
+    - [Merge `InPort`](#merge-InPort)
+    - [Merge `OutPort`](#merge-OutPort)
+  - [`Inputs` & `Outputs`](#inport--outport)
     - [Left and right shift operations](#left-and-right-shift-operations-1)
     - [Merge operations](#merge-operations)
     - [Subtract operations](#subtract-operations)
-  - [`InCollections` & `OutCollections`](#incollections--outcollections)
-    - [Left and right shift operations](#left-and-right-shift-operations-2)
-    - [Merge operations](#merge-operations-1)
-    - [Subtract operations](#subtract-operations-1)
   - [`Pipeline`](#pipeline)
-    - [Left and right shift operations](#left-and-right-shift-operations-3)
+    - [Left and right shift operations](#left-and-right-shift-operations-2)
     - [Rename inputs and outputs](#rename-inputs-and-outputs)
     - [Decorate pipelines](#decorate-pipelines)
-- [Pipelines Definition](#pipelines-definition)
+- [Pipeline Definition](#pipelines-definition)
 - [Meta-Pipelines](#meta-pipelines)
   - [`Estimator` Example](#estimator-example)
 - [Services](#services)
 
-## Operations with Parameter Entries and Collections
+## Operations with InPort/OutPort and Inputs/Outputs
 
 Consider the `standardization`, `logistic_regression` and `stz_lr` examples from the
 [intermediate tutorial](tutorial_intermediate.md).
 
 ![stz_lr](figures/stz_lr.png){ width="300" }
 
-### `InEntry` & `OutEntry`
+### `InPort` & `OutPort`
 
 #### Left and right shift operations
 
@@ -48,16 +44,17 @@ lr_eval.inputs.scale << stz_train.outputs.scale
 or
 
 ```python
-standardization.out_collections.Z.train >> logistic_regression.in_collections.X.train
-logistic_regression.in_collections.X.eval << standardization.out_collections.Z.eval
+standardization.outputs.Z.train >> logistic_regression.inputs.X.train
+logistic_regression.inputs.X.eval << standardization.outputs.Z.eval
 ```
 
-These operators return a dependency wrapped in a tuple (for compatibility with collections).
+These operators return the matching dependencies wrapped in a
+[`collections.abc.Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) object.
 They can be passed as part of dependencies for combining pipelines.
 
-#### Merge `InEntry`
+#### Merge `InPort`
 
-You can merge two `InEntry` into a single one using `|` between entries, which will bind the inputs
+You can merge two `InPort` into a single one using `|` between entries, which will bind the inputs
 together, exposing a single entry.
 For example,
 
@@ -89,9 +86,9 @@ mapping the `probs` from `logistic_regression` to both `r1` and `r2`.
 
 ![broadcast](figures/broadcast.png){ width="300" }
 
-#### Merge `OutEntry`
+#### Merge `OutPort`
 
-If you merge `OutEntry`s with `|` operator you will indicate that the outputs are to be
+If you merge `OutPort`s with `|` operator you will indicate that the outputs are to be
 concatenated.
 For example,
 
@@ -123,22 +120,37 @@ summary.inputs.accuracies << reports.outputs.acc
 ```
 
 We combine `r1` and `r2` into a single pipeline and we expose `acc` as single output, which is the
-concatenation of `acc` from `r1` and `r2`.
+concatenation of `acc` from `r1` and `r2` into a tuple.
 Therefore, when used in as a dependency assignment, a concatenation operation will be performed
 gathering the `acc` from `r1` and `r2` into a single `acc` before passing it to `summary`.
 
 Order is preserved in all merge operations.
 
 ![concatenate](figures/concatenate.png){ width="300" }
 
-### `Inputs` & `Outputs` and `InCollection` & `OutCollection`
+### `Inputs` & `Outputs` and `InPorts[T]` & `OutPorts[T]`
+
+`InPorts[T]` and `OutPorts[T]` are extensions of `namedcollection[T]`, which is a data structure
+that implements a nested dictionary with dot notation access.
+Additionally, these types implement operations to create dependencies between them.
+They are [generic classes](https://mypy.readthedocs.io/en/stable/generics.html),
+where `T` is the type of the values stored in the collection.
+
+`Inputs` and `Outputs` are aliases of `InPorts[Any]` and `OutPorts[Any]`, respectively, and
+operations between them are identical, i.e.,
+
+```python
+Inputs = InPorts[Any]
+Outputs = OutPorts[Any]
+```
 
-`InCollection` and `OutCollection` are aliases of `Inputs` and `Outputs`, respectively, and
-operations between them are identical.
+When annotating inputs or outputs of a pipeline, you can extend `Inputs` and `Outputs` to specify
+the type of the values stored in the collection. See the
+[annotations section](tutorial_intermediate.md#pipeline-annotations) for further details.
 
 #### Left and right shift operations
 
-Similar to entry assignments, you can create dependencies using the left and right shift operators.
+Similar to port assignments, you can create dependencies using the left and right shift operators.
 
 ```python
 stz_eval.inputs << stz_train.outputs
@@ -151,25 +163,41 @@ stz_eval.inputs.mean << stz_train.outputs.mean
 stz_eval.inputs.scale << stz_train.outputs.scale
 ```
 
-or alternatively, operating with collections:
+or alternatively, operating with variable collections:
 
 ```python
-logistic_regression.in_collections.X << standardization.out_collections.Z
+logistic_regression.inputs.X << standardization.outputs.Z
 ```
 
 which is equivalent to:
 
 ```python
-logistic_regression.in_collections.X.train << standardization.out_collections.Z.train
-logistic_regression.in_collections.X.eval << standardization.out_collections.Z.eval
+logistic_regression.inputs.X.train << standardization.outputs.Z.train
+logistic_regression.inputs.X.eval << standardization.outputs.Z.eval
 ```
 
 > :bulb: **Info:**
 The set of names of the two collections need to be identical.
 Entries will be matched by name to create depedencies, e.g.,
 `Z.train` with `X.train` and `Z.eval` with `X.eval`, respectively, in the above example.
 
-The operation returns a tuple of dependencies created.
+The operation returns a
+[`collections.abc.Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)
+object of dependencies created.
+
+#### Rename operations
+
+You can rename entries of `Inputs` and `Outputs` objects with `rename_inputs` and `rename_outputs`.
+
+```python
+stz_train = stz_train.rename_inputs({"X": "X.train"})
+stz_eval = stz_eval.rename_inputs({"X": "X.eval"})
+stz_train.train  # OutPort object
+stz_eval.eval  # OutPort object
+```
+
+The syntax is {"<old_name>": "<new_name>"} as single method's argument.
+The methods will return a new `Inputs` or `Outputs` object with renamed entries.
 
 #### Merge operations
 
@@ -179,9 +207,9 @@ For example,
 ```python
 stz_train = stz_train.rename_inputs({"X": "X.train"})
 stz_eval = stz_eval.rename_inputs({"X": "X.eval"})
-X_collection = stz_train.in_collections.X | stz_eval.in_collections.X
-X_collection.train  # OutEntry objects
-X_collection.eval
+X = stz_train.inputs.X | stz_eval.inputs.X
+X.train  # OutPort object
+X.eval  # OutPort object
 ```
 
 The above example will create a single collection with two entries, `X.train` and `X.eval`.
@@ -190,64 +218,19 @@ If the collections share the same entry names, entries will be merged together a
 
 This behavior is the same for `Outputs` objects.
 
-The behavior for creating collections via `rename_inputs` or `rename_outputs` is explained in this
-[section](##rename-inputs-and-outputs) below.
-
 #### Subtract operations
 
-You can subtract variable names or entry parameters:
+You can subtract variable names from `Inputs` and `Outputs` objects:
 
 ```python
-new_inputs = standardization.in_collections - ("X",)
-new_inputs = standardization.in_collections - (standardization.in_collections.X.train,)
+new_inputs_no_x = standardization.inputs - ("X",)
+new_inputs_no_x_train = standardization.inputs - ("X.train",)
 ```
 
-The syntax requires subtracting an `Iterable` (like `tuple`, `list`, `set`, etc.).
+The syntax requires subtracting an `Iterable[str]`
+(like `tuple[str, ...]`, `list[str]`, `collections.abc.Set[str]`, etc.).
 If what you are trying to subtract does not exist, no error will be issued.
 
-### `InCollections` & `OutCollections`
-
-#### Left and right shift operations
-
-In the same spirit as with other types, one can do left and right shift operations on
-`InCollections` and `OutCollections` types.
-
-Shared variables between pipelines will be associated together:
-
-```python
-logistic_regression.out_collections >> logistic_regression.in_collections
-```
-
-which is equivalent to
-
-```pythyon
-logistic_regression.out_collections.X >> logistic_regression.in_collections.X
-logistic_regression.out_collections.Y >> logistic_regression.in_collections.Y
-```
-
-Using the left / right operator in the wrong direction will raise an error.
-
-#### Merge operations
-
-Similar to `Inputs` and `Outputs`, one can merge `InCollections` and `OutCollections`.
-Collections and entries will be merged together.
-
-#### Subtract operations
-
-You can subtract variables, single parameters, or collection of parameters:
-
-```python
-new_inputs = standardization.in_collections - ("X",)
-new_inputs = standardization.in_collections - (standardization.in_collections.X,)
-new_inputs = standardization.in_collections - (standardization.in_collections.X.train,)
-```
-
-All of the above are equivalent for `X` a collection of length one.
-Otherwise, the whole collection will be subtracted if you pass `X` or
-`standardization.in_collections.X`.
-
-Note that the syntax requires subtracting an `Iterable` (like `tuple`, `list`, `set`, etc.).
-If what you are trying to subtract does not exist, no error will be issued.
 
 ### `Pipeline`
 
@@ -257,8 +240,7 @@ The IO attributes above are slightly redundant in the above example, so one can
 directly operating with pipelines.
 
 Left / right shifting with pipeline objects will create the dependencies between the `inputs` and
-`outputs`, `in_collections` and `out_collections`, of the pipelines in the direction of the shift,
-for equally named collections.
+`outputs` of the pipelines in the direction of the shift, for equally named collections.
 
 Here is an example:
 
@@ -292,17 +274,17 @@ For example, if we want to rename the `X` input of `stz_train` to `features`:
 
 ```python
 stz_train = stz_train.rename_inputs({"X": "features"})
-stz_train.inputs.features  # InEntry object
+stz_train.inputs.features  # InPort object
 ```
 
 You can transform single entries into collections, or entries from collections into entries via dot
 notation:
 
 ```python
 standardization = standardization.rename_inputs({"X.train": "X_train", "X.eval": "X_eval"})
-standardization.inputs.X_train  # InEntry objects
+standardization.inputs.X_train  # InPort objects
 standardization.inputs.X_eval
-# standardization.in_collections.X.train  # raises error
+# standardization.inputs.X.train  # raises error
 ```
 
 If the new name of an entry already exists, or repeats, a merge operation will be performed.
@@ -317,8 +299,10 @@ reports = PipelineBuilder.combine(
 )
 reports.in_collection.acc  # Inputs collection with two entries
 reports.rename_inputs({"acc.r1": "acc", "acc.r2": "acc"})  # rename / merge operation
-reports.inputs.acc # InEntry object with two entries merged together
+reports.inputs.acc # InPort object with two entries merged together (broadcast)
 ```
+See [`InPort` merge section](#merge-inport) for details on broadcast
+and [`OutPort` merge section](#merge-outport) foir details on concatenation operations.
 
 #### Decorate pipelines
 
@@ -359,9 +343,6 @@ A `Pipeline` is a (frozen) dataclass with the following attributes:
     distinguiss pipelines when combining.
 - `inputs` (`oneml.processors.Inputs`): exposure of `Inputs` of a pipeline.
 - `outputs` (`oneml.processors.Outputs`): exposure of `Outputs` of a pipeline.
-- `in_collections` (`oneml.processors.InCollections`): exposure of `InCollections` of a pipeline.
-- `out_collections` (`oneml.processors.OutCollections`): exposure of `OutCollections` of a
-    pipeline.
 
 `Pipeline`s should not be instantiated directly.
 Instead, `Pipeline`s should be created via `Task`s, combining other `Pipeline`s, or other
@@ -400,9 +381,9 @@ class Estimator(Pipeline):
         # decorate shared dependencies to match newly decorated train and eval pipelines
         dependencies = (dp.decorate("eval", "train") for dp in chain.from_iterable(dependencies))
 
-        # merge the `outputs` and `out_collections` of train and eval pipelines
+        # merge the `outputs` and `outputs` of train and eval pipelines
         outputs: UserOutput = dict(train_pl.outputs | eval_pl.outputs)
-        outputs |= dict(train_pl.out_collections | eval_pl.out_collections)
+        outputs |= dict(train_pl.outputs | eval_pl.outputs)
 
         # combine all ingredients into a new pipeline
         p = PipelineBuilder.combine(
@@ -412,7 +393,7 @@ class Estimator(Pipeline):
             outputs=outputs,
             dependencies=(tuple(dependencies),),
         )
-        super().__init__(name, p._dag, p.inputs, p.outputs, p.in_collections, p.out_collections)
+        super().__init__(name, p._dag, p.inputs, p.outputs)
 ```
 
 A few clarifications: