Skip to content

Commit

Permalink
Merge ADO latest main into github (#17)
Browse files Browse the repository at this point in the history
* Merged PR 6148: Services that create pipelines

* `OnemlProcessorsPipelineOperationsServices.COLLECTION_TO_DICT` returns a pipeline with an in_collection and an output that exposes the collection as a dictionary.
* `OnemlProcessorsPipelineOperationsServices.DICT_TO_COLLECTION` returns a pipeline with a dictionary input and an out_collection the entries of the input dictionary as collection entries.
* `OnemlProcessorsPipelineOperationsServices.DUPLICATE_PIPELINE` takes a pipeline and returns a pipeline with multiple copies of it.
* `OnemlProcessorsPipelineOperationsServices.EXPOSE_GIVEN_OUTPUTS` takes data (dict of outputs, dict of dict of out collections) and creates a pipeline that exposes that data as output.
* `OnemlProcessorsPipelineOperationsServices.EXPOSE_PIPELINE_AS_OUTPUT` takes a pipeline, returns an identical pipeline except it has an additional output exposing the given pipeline.
* `OnemlProcessorsPipelineOperationsServices.LOAD_INPUTS_SAVE_OUTPUTS` takes a pipeline with an in_collection called `inputs_to_load` and an out_collection called `outputs_to_save` and returns a pipeline that loads the requested inputs from uris, passes them to the original pipeline, then saves the requested outputs to uris.
* `OnemlHabitatsPipelineOperationsServices.PUBLISH_OUTPUTS_AS_DATASET` takes a pipeline that loads and saves to/from uris and returns a pipeline that loads from datasets and publishes a dataset.

* Merged PR 6177: allow from oneml.habitats.immunocli import OnemlHabitatsCliDiContainer b/c cl...

allow from oneml.habitats.immunocli import OnemlHabitatsCliDiContainer b/c cli di containers should be exposed

* Merged PR 6178: fix service method and test that the service is added

* Merged PR 6184: fixes to https://immunomics.visualstudio.com/Immunomics/_git/oneml/pullrequest/6148

* Merged PR 6187: Generic pipelines with two arguments only

* Merged PR 6193: Minor fix regarding instantiation order of namedcollection.

* Fix.
* Tetst.

* Merged PR 6206: test load_inputs_save_outputs when pipeline already has input_uris and output...

test load_inputs_save_outputs when pipeline already has input_uris and output_uris

* Merged PR 6225: fix services in oneml.habitats.pipeline_operations._datasets

fix services in oneml.habitats.pipeline_operations._datasets

* Merged PR 6232: remove furl from oneml.habitats.pipeline_operations b/c it lower-cases datase...

remove furl from oneml.habitats.pipeline_operations b/c it lower-cases dataset name in ampds://

* Merged PR 6234: fix register io for Manifestaon abfss

* Merged PR 6236: fix register io for Manifest on abfss

* Merged PR 6239: fixed ComputeNodeBasedOutputUriProcessor intoducing uris with empty path segments

* Merged PR 6231: Support pipeline providers in YAML

* Refactors plugins into every component.
* Adds hydra registry for pipeline providers.
* Adds two_diamond tests building a pipeline in YAML using a python pipeline provider.

* Merged PR 6255: expose BLOB_READ_USING_LOCAL_CACHE_FACTORY and BLOB_WRITE_USING_LOCAL_CACHE_F...

expose BLOB_READ_USING_LOCAL_CACHE_FACTORY and BLOB_WRITE_USING_LOCAL_CACHE_FACTORY for use by other packages

* Merged PR 6256: Add ability to duplicate pipeliens in YAML.

* Adds `DuplicatePipelineConf` to be able to generate duplicate pipelines in YAML.
* Adds three diamond as test case.

* Merged PR 6250: oneml.processors.registry of pipeline providers

* Merged PR 6279: 📝 (docs) Update documentation.

📝 (docs) Update documentation.

* Merged PR 6286: fix container name in DatasetBlobStoreBaseLocationService

DatasetBlobStoreBaseLocationService provides the blob locations for datasets published by oneml.
It should read these from an installation level configuration, but at the moment has them hard coded.
The container name used from production and non-production datasets outside notebooks is wrong.  This PR hopefully fixes it.

* Merged PR 6288: Fix dataset blob location

* Merged PR 6293: fix get_relative_path when base_uri ends with /

fix get_relative_path when base_uri ends with /

* Merged PR 6298: 📝 Update docs.

📝 Update docs.

* Merged PR 6311: Service inheritance test

add test that verifies mypy screams when a service id is associated with a service that is a super-type of the declared service type

* Merged PR 6310: 🚸 Optional inputs when combining pipelines can be dropped and add tests.

🚸 Refactor pipeline input validation for optional inputs and add tests.

* Merged PR 6312: ✨ Add pipeline drop_inputs and drop_outputs and tests

✨ Add pipeline drop_inputs and drop_outputs methods

* Merged PR 6322: test publish_outputs_as_dataset when there are no input_uris

test publish_outputs_as_dataset when there are no input_uris

* Merged PR 6331: Add support for symbolic links in uris

Datasets published by oneml pipelines have a manifest.json at their root that maps output names to relative paths within the dataset.

Pipeline builders can read from uris using `OnemlProcessorsIoServices.READ_FROM_URI_PIPELINE_BUILDER`, which by itself uses `ReadFromUriProcessor`, that takes a uri and returns the read object.

This PR adds the following semantic to uris:
If the uri has a fragment (i.e. something that follows #) then it is assumed that (after removing the fragment) the uri points to a json file.  The fragment is assumed to be a dot-seprated hierarchical key into the json.  The value associated with the key should be either a relative path from the directory of the json, or an absolute uri.

Examples:
if `file:///path1/path2/index.json` holds:
```
    links:
        rel: path3/array.npy
        abs: ampds://mydataset/
        another_abs: ampds://mydataset/manifest.json#entry_uris.container1?namespace=mynamespace
```
and `ampds://mydataset/manifest.json` holds:
```
    entry_uris:
        container1: containers/container1
```

Then:
* `file:///path1/path2/index.json#links.rel` becomes `file:///path1/path2/path3/array.npy`
* `file:///path1/path2/index.json#links.abs` becomes `ampds://mydataset/`
* `file:///path1/path2/index.json#links.another_abs` becomes `ampds://mydataset/containers/container1?namespace=mynamespace`

The semantics are implemented in `ReadFromUriProcessor` and therefore applicable only to pipelines built directly or indirectly using `OnemlProcessorsIoServices.READ_FROM_URI_PIPELINE_BUILDER`.

* Fix mypy and tests after merge.

---------

Co-authored-by: elonp <[email protected]>
Co-authored-by: jzazo <[email protected]>
  • Loading branch information
3 people authored Jan 4, 2024
1 parent a6e6817 commit c4cc4dc
Show file tree
Hide file tree
Showing 148 changed files with 6,917 additions and 2,253 deletions.
7 changes: 4 additions & 3 deletions docs/user_documentation/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
### What is OneML?

OneML is a framework to design and run pipelines for ML Research.
Users organize their code writing classes which include ML functions and specify inputs and
Users organize their code writing classes which encode ML functions and declare inputs and
outputs.
Users design pipelines linking these functions together and can run them.
Users construct pipelines connecting these inputs and outputs into a DAG.
Pipelines can be run locally or in a cluster.


### Why is OneML different from other ML pipelines tools?
Expand Down Expand Up @@ -41,5 +42,5 @@ functionality, and where ML concepts are built in to facilitate validity and ana

OneML provides a framework to connect code organized in classes, abstract the complexities from
running distributedly in multiple nodes, and facilitates organizing data around train and
evaluation splits, supporting cross-validation, hyperaparameter optimizatino, and other pipeline
evaluation splits, supporting cross-validation, hyperaparameter optimization, and other pipeline
wrappers that are commonly used.
17 changes: 12 additions & 5 deletions docs/user_documentation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,22 @@

- **DAG**: Directed acyclic graph. We use the term as synonym of *pipeline*.

- **Processor:** A class object with a *process* method. Implments on operation taking inputs and producing outputs. The signatures of the processor's constructor and process method define the inputs and outputs of the
- **Processor:** A class object with a *process* method. Implements on operation taking inputs and
producing outputs. The signatures of the processor's constructor and process method define the
inputs and outputs of the pipeline's *node*.

- **Node**: Vertex, or compute point, of a *pipeline*.
Nodes hold references to *processor* classes, input & output parameters, constructor arguments and compute requirements.
During orchestration, *processors* are instantiated and run from nodes encoded in the *pipeline*.
Nodes hold references to *processor* classes, input & output parameters, constructor arguments and
compute requirements. During orchestration, *processors* are instantiated and run from nodes
encoded in the *pipeline*.

- **Dependency:** Directed edge from one compute *node* output to another compute *node* input. The output of one *processor* feeds into the input of the next *processor*.
- **Port**: Input or output of a *node*. *Output ports* are connected into *input ports* forming
a dependency, where the output of a *processor* is given as input to the receiving *processor*.

- **Dependency:** Directed edge from a *node's* output port to another *node's* input port.
The output of one *processor* feeds into the input of the next *processor*.
During orchestration, outputs are passed as inputs to the depending nodes.

- **Task:** A single node pipeline, created by wrapping over a Processor.
- **Task:** A single node pipeline, created by wrapping over a *processor's* class.

- **TrainAndEval:** A classical ML pipeline, taking training data and eval data, learning a model using the training data, and using it to provide predictions on both training and eval data. OneML provides tools to help build and manipulate TrainAndEval pipelines, including automatically persisting a fitted eval pipeline that can take other eval data and provide predictions.
171 changes: 76 additions & 95 deletions docs/user_documentation/tutorial_advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,36 +5,32 @@ and build your own meta-pipelines.

#### Table of Contents

- [Operations with Parameter Entries and Collections](#operations-with-parameter-entries-and-collections)
- [`InEntry` & `OutEntry`](#inentry--outentry)
- [Operations with InPort/OutPort and Inputs/Outputs](#operations-with-inportoutport-and-inputsoutputs)
- [`InPort` & `OutPort`](#InPort--OutPort)
- [Left and right shift operations](#left-and-right-shift-operations)
- [Merge `InEntry`](#merge-inentry)
- [Merge `OutEntry`](#merge-outentry)
- [`Inputs` & `Outputs`](#inputs--outputs)
- [Merge `InPort`](#merge-InPort)
- [Merge `OutPort`](#merge-OutPort)
- [`Inputs` & `Outputs`](#inport--outport)
- [Left and right shift operations](#left-and-right-shift-operations-1)
- [Merge operations](#merge-operations)
- [Subtract operations](#subtract-operations)
- [`InCollections` & `OutCollections`](#incollections--outcollections)
- [Left and right shift operations](#left-and-right-shift-operations-2)
- [Merge operations](#merge-operations-1)
- [Subtract operations](#subtract-operations-1)
- [`Pipeline`](#pipeline)
- [Left and right shift operations](#left-and-right-shift-operations-3)
- [Left and right shift operations](#left-and-right-shift-operations-2)
- [Rename inputs and outputs](#rename-inputs-and-outputs)
- [Decorate pipelines](#decorate-pipelines)
- [Pipelines Definition](#pipelines-definition)
- [Pipeline Definition](#pipelines-definition)
- [Meta-Pipelines](#meta-pipelines)
- [`Estimator` Example](#estimator-example)
- [Services](#services)

## Operations with Parameter Entries and Collections
## Operations with InPort/OutPort and Inputs/Outputs

Consider the `standardization`, `logistic_regression` and `stz_lr` examples from the
[intermediate tutorial](tutorial_intermediate.md).

![stz_lr](figures/stz_lr.png){ width="300" }

### `InEntry` & `OutEntry`
### `InPort` & `OutPort`

#### Left and right shift operations

Expand All @@ -48,16 +44,17 @@ lr_eval.inputs.scale << stz_train.outputs.scale
or

```python
standardization.out_collections.Z.train >> logistic_regression.in_collections.X.train
logistic_regression.in_collections.X.eval << standardization.out_collections.Z.eval
standardization.outputs.Z.train >> logistic_regression.inputs.X.train
logistic_regression.inputs.X.eval << standardization.outputs.Z.eval
```

These operators return a dependency wrapped in a tuple (for compatibility with collections).
These operators return the matching dependencies wrapped in a
[`collections.abc.Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) object.
They can be passed as part of dependencies for combining pipelines.

#### Merge `InEntry`
#### Merge `InPort`

You can merge two `InEntry` into a single one using `|` between entries, which will bind the inputs
You can merge two `InPort` into a single one using `|` between entries, which will bind the inputs
together, exposing a single entry.
For example,

Expand Down Expand Up @@ -89,9 +86,9 @@ mapping the `probs` from `logistic_regression` to both `r1` and `r2`.

![broadcast](figures/broadcast.png){ width="300" }

#### Merge `OutEntry`
#### Merge `OutPort`

If you merge `OutEntry`s with `|` operator you will indicate that the outputs are to be
If you merge `OutPort`s with `|` operator you will indicate that the outputs are to be
concatenated.
For example,

Expand Down Expand Up @@ -123,22 +120,37 @@ summary.inputs.accuracies << reports.outputs.acc
```

We combine `r1` and `r2` into a single pipeline and we expose `acc` as single output, which is the
concatenation of `acc` from `r1` and `r2`.
concatenation of `acc` from `r1` and `r2` into a tuple.
Therefore, when used in as a dependency assignment, a concatenation operation will be performed
gathering the `acc` from `r1` and `r2` into a single `acc` before passing it to `summary`.

Order is preserved in all merge operations.

![concatenate](figures/concatenate.png){ width="300" }

### `Inputs` & `Outputs` and `InCollection` & `OutCollection`
### `Inputs` & `Outputs` and `InPorts[T]` & `OutPorts[T]`

`InPorts[T]` and `OutPorts[T]` are extensions of `namedcollection[T]`, which is a data structure
that implements a nested dictionary with dot notation access.
Additionally, these types implement operations to create dependencies between them.
They are [generic classes](https://mypy.readthedocs.io/en/stable/generics.html),
where `T` is the type of the values stored in the collection.

`Inputs` and `Outputs` are aliases of `InPorts[Any]` and `OutPorts[Any]`, respectively, and
operations between them are identical, i.e.,

```python
Inputs = InPorts[Any]
Outputs = OutPorts[Any]
```

`InCollection` and `OutCollection` are aliases of `Inputs` and `Outputs`, respectively, and
operations between them are identical.
When annotating inputs or outputs of a pipeline, you can extend `Inputs` and `Outputs` to specify
the type of the values stored in the collection. See the
[annotations section](tutorial_intermediate.md#pipeline-annotations) for further details.

#### Left and right shift operations

Similar to entry assignments, you can create dependencies using the left and right shift operators.
Similar to port assignments, you can create dependencies using the left and right shift operators.

```python
stz_eval.inputs << stz_train.outputs
Expand All @@ -151,25 +163,41 @@ stz_eval.inputs.mean << stz_train.outputs.mean
stz_eval.inputs.scale << stz_train.outputs.scale
```

or alternatively, operating with collections:
or alternatively, operating with variable collections:

```python
logistic_regression.in_collections.X << standardization.out_collections.Z
logistic_regression.inputs.X << standardization.outputs.Z
```

which is equivalent to:

```python
logistic_regression.in_collections.X.train << standardization.out_collections.Z.train
logistic_regression.in_collections.X.eval << standardization.out_collections.Z.eval
logistic_regression.inputs.X.train << standardization.outputs.Z.train
logistic_regression.inputs.X.eval << standardization.outputs.Z.eval
```

> :bulb: **Info:**
The set of names of the two collections need to be identical.
Entries will be matched by name to create depedencies, e.g.,
`Z.train` with `X.train` and `Z.eval` with `X.eval`, respectively, in the above example.

The operation returns a tuple of dependencies created.
The operation returns a
[`collections.abc.Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)
object of dependencies created.

#### Rename operations

You can rename entries of `Inputs` and `Outputs` objects with `rename_inputs` and `rename_outputs`.

```python
stz_train = stz_train.rename_inputs({"X": "X.train"})
stz_eval = stz_eval.rename_inputs({"X": "X.eval"})
stz_train.train # OutPort object
stz_eval.eval # OutPort object
```

The syntax is {"<old_name>": "<new_name>"} as single method's argument.
The methods will return a new `Inputs` or `Outputs` object with renamed entries.

#### Merge operations

Expand All @@ -179,9 +207,9 @@ For example,
```python
stz_train = stz_train.rename_inputs({"X": "X.train"})
stz_eval = stz_eval.rename_inputs({"X": "X.eval"})
X_collection = stz_train.in_collections.X | stz_eval.in_collections.X
X_collection.train # OutEntry objects
X_collection.eval
X = stz_train.inputs.X | stz_eval.inputs.X
X.train # OutPort object
X.eval # OutPort object
```

The above example will create a single collection with two entries, `X.train` and `X.eval`.
Expand All @@ -190,64 +218,19 @@ If the collections share the same entry names, entries will be merged together a

This behavior is the same for `Outputs` objects.

The behavior for creating collections via `rename_inputs` or `rename_outputs` is explained in this
[section](##rename-inputs-and-outputs) below.

#### Subtract operations

You can subtract variable names or entry parameters:
You can subtract variable names from `Inputs` and `Outputs` objects:

```python
new_inputs = standardization.in_collections - ("X",)
new_inputs = standardization.in_collections - (standardization.in_collections.X.train,)
new_inputs_no_x = standardization.inputs - ("X",)
new_inputs_no_x_train = standardization.inputs - ("X.train",)
```

The syntax requires subtracting an `Iterable` (like `tuple`, `list`, `set`, etc.).
The syntax requires subtracting an `Iterable[str]`
(like `tuple[str, ...]`, `list[str]`, `collections.abc.Set[str]`, etc.).
If what you are trying to subtract does not exist, no error will be issued.

### `InCollections` & `OutCollections`

#### Left and right shift operations

In the same spirit as with other types, one can do left and right shift operations on
`InCollections` and `OutCollections` types.

Shared variables between pipelines will be associated together:

```python
logistic_regression.out_collections >> logistic_regression.in_collections
```

which is equivalent to

```pythyon
logistic_regression.out_collections.X >> logistic_regression.in_collections.X
logistic_regression.out_collections.Y >> logistic_regression.in_collections.Y
```

Using the left / right operator in the wrong direction will raise an error.

#### Merge operations

Similar to `Inputs` and `Outputs`, one can merge `InCollections` and `OutCollections`.
Collections and entries will be merged together.

#### Subtract operations

You can subtract variables, single parameters, or collection of parameters:

```python
new_inputs = standardization.in_collections - ("X",)
new_inputs = standardization.in_collections - (standardization.in_collections.X,)
new_inputs = standardization.in_collections - (standardization.in_collections.X.train,)
```

All of the above are equivalent for `X` a collection of length one.
Otherwise, the whole collection will be subtracted if you pass `X` or
`standardization.in_collections.X`.

Note that the syntax requires subtracting an `Iterable` (like `tuple`, `list`, `set`, etc.).
If what you are trying to subtract does not exist, no error will be issued.

### `Pipeline`

Expand All @@ -257,8 +240,7 @@ The IO attributes above are slightly redundant in the above example, so one can
directly operating with pipelines.

Left / right shifting with pipeline objects will create the dependencies between the `inputs` and
`outputs`, `in_collections` and `out_collections`, of the pipelines in the direction of the shift,
for equally named collections.
`outputs` of the pipelines in the direction of the shift, for equally named collections.

Here is an example:

Expand Down Expand Up @@ -292,17 +274,17 @@ For example, if we want to rename the `X` input of `stz_train` to `features`:

```python
stz_train = stz_train.rename_inputs({"X": "features"})
stz_train.inputs.features # InEntry object
stz_train.inputs.features # InPort object
```

You can transform single entries into collections, or entries from collections into entries via dot
notation:

```python
standardization = standardization.rename_inputs({"X.train": "X_train", "X.eval": "X_eval"})
standardization.inputs.X_train # InEntry objects
standardization.inputs.X_train # InPort objects
standardization.inputs.X_eval
# standardization.in_collections.X.train # raises error
# standardization.inputs.X.train # raises error
```

If the new name of an entry already exists, or repeats, a merge operation will be performed.
Expand All @@ -317,8 +299,10 @@ reports = PipelineBuilder.combine(
)
reports.in_collection.acc # Inputs collection with two entries
reports.rename_inputs({"acc.r1": "acc", "acc.r2": "acc"}) # rename / merge operation
reports.inputs.acc # InEntry object with two entries merged together
reports.inputs.acc # InPort object with two entries merged together (broadcast)
```
See [`InPort` merge section](#merge-inport) for details on broadcast
and [`OutPort` merge section](#merge-outport) foir details on concatenation operations.

#### Decorate pipelines

Expand Down Expand Up @@ -359,9 +343,6 @@ A `Pipeline` is a (frozen) dataclass with the following attributes:
distinguiss pipelines when combining.
- `inputs` (`oneml.processors.Inputs`): exposure of `Inputs` of a pipeline.
- `outputs` (`oneml.processors.Outputs`): exposure of `Outputs` of a pipeline.
- `in_collections` (`oneml.processors.InCollections`): exposure of `InCollections` of a pipeline.
- `out_collections` (`oneml.processors.OutCollections`): exposure of `OutCollections` of a
pipeline.

`Pipeline`s should not be instantiated directly.
Instead, `Pipeline`s should be created via `Task`s, combining other `Pipeline`s, or other
Expand Down Expand Up @@ -400,9 +381,9 @@ class Estimator(Pipeline):
# decorate shared dependencies to match newly decorated train and eval pipelines
dependencies = (dp.decorate("eval", "train") for dp in chain.from_iterable(dependencies))

# merge the `outputs` and `out_collections` of train and eval pipelines
# merge the `outputs` and `outputs` of train and eval pipelines
outputs: UserOutput = dict(train_pl.outputs | eval_pl.outputs)
outputs |= dict(train_pl.out_collections | eval_pl.out_collections)
outputs |= dict(train_pl.outputs | eval_pl.outputs)

# combine all ingredients into a new pipeline
p = PipelineBuilder.combine(
Expand All @@ -412,7 +393,7 @@ class Estimator(Pipeline):
outputs=outputs,
dependencies=(tuple(dependencies),),
)
super().__init__(name, p._dag, p.inputs, p.outputs, p.in_collections, p.out_collections)
super().__init__(name, p._dag, p.inputs, p.outputs)
```

A few clarifications:
Expand Down
Loading

0 comments on commit c4cc4dc

Please sign in to comment.