Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/data_transformers #1409

Merged
merged 36 commits into from
Mar 7, 2023

Conversation

mabilton
Copy link
Contributor

@mabilton mabilton commented Dec 4, 2022

Fixes #1407.

Summary

This PR makes two major changes to how data transformers work:

  1. Fixed parameters and fitted parameters (if any) are now 'automatically' passed to ts_transform, ts_inverse_transform, and ts_fit by default, without the user having to re-implement _transform_iterator, _inverse_transform_iterator, or _fit_iterator. To accommodate this, each of these ts_* methods now accepts a params (dictionary) argument, where params['fixed'] stores the fixed parameters of the transformation (defined to be all those attributes defined in the child-most class before calling super().__init__) and params['fitted'] stores the fitted parameters (i.e. what ts_fit returned).
  2. component_mask key word arguments will be automatically applied to timeseries inputs given to ts_* and automatically 'unapplied' to timeseries outputs returned by these methods, which means that users don't have to worry about 'manually' dealing with these arguments inside of their implemented ts_* methods. If the user does not wish for component_masks to be automatically applied, they may specify mask_components=False when calling super().__init__; this will cause any component_mask key word argument to be passed via kwargs to the called method (i.e. current behaviour).

To see how these changes can help simplify the work involved in implementing a new transformation, compare the current implementation of BoxCox and with the BoxCox implementation in this PR.

Other Information

Some other minor changes that come with this PR:

  1. I split up the _reshape_in method into two new methods (apply_component_mask and stack_samples), as well as _reshape_out into two new methods (unapply_component_mask and unstack_samples). There are two reasons for this change:
    • _reshape_in and _reshape_out were responsible for applying two distinctly different changes to the data: masking component columns and stacking the samples of each component along a single axis. From a user interaction and maintainability perspective, I think it's much cleaner to have these two pieces of functionality separated from one another. The names _reshape_in and _reshape_out are, in my opinion, also a bit vague.
    • In the original implementation of _reshape_in/_reshape_out, the 'stacking step' was performed using a for loop; I've changed this so that only np.swapaxes and np.reshape operations are used , which theoretically should speed things up a bit.
    • I've made these new methods explicitly public, so that it's explicitly clear to new users looking to write their own transformations that it's "okay" to call these functions.
  2. I've refactored all of the currently implemented transformations so that they conform with these changes, with the exception of StaticCovariatesTransformer, which directly overrides the fit, inverse_transform and transform methods anyways. Similarly, I also had to make some minor adjustments to existing tests.
  3. I've implemented new tests for the refactored BaseDataTransformer , FittableDataTransformer, and InvertableDataTransformer classes.
  4. Some transformations, such as BoxCox, allow for different fixed parameter values to be distributed over different parallel jobs. To facillitate this, I've added a parallel_params argument to the *Transformer classes, which allows the user to specify which parameters should take different values for different parallel jobs.

There are two drawbacks to what I've done here:

  1. Some of these changes are slightly breaking; with that being said, I doubt many users currently have their own private code bases with their own custom-implemented darts transformations (although I could be wrong).
  2. One 'gotcha' with this proposed interface is that the user must only call super().__init__ after initialising the fixed parameters of their transformation. For example, the following will allow the user to access '_my_param' in params['fixed']:
    class MyTransform(BaseDataTransformer):
        def __init__(self):
               # Correct: Define fixed parameter *before* calling `super().__init__`
               self._my_param = 1
               super().__init__()
    whereas the following will not:
    class MyTransform(BaseDataTransformer):
        def __init__(self):
               # Incorrect: Define fixed parameter *after* calling `super().__init__`
               super().__init__()
               self._my_param = 1

Any thoughts/comments on these changes are more than welcome.

Cheers,
Matt.

@codecov-commenter
Copy link

codecov-commenter commented Dec 4, 2022

Codecov Report

Patch coverage: 98.04% and no project coverage change

Comparison is base (d2ba591) 94.05% compared to head (93dc878) 94.06%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1409   +/-   ##
=======================================
  Coverage   94.05%   94.06%           
=======================================
  Files         125      125           
  Lines       11185    11248   +63     
=======================================
+ Hits        10520    10580   +60     
- Misses        665      668    +3     
Impacted Files Coverage Δ
darts/dataprocessing/transformers/boxcox.py 95.91% <90.90%> (-4.09%) ⬇️
...taprocessing/transformers/base_data_transformer.py 96.55% <96.20%> (-0.23%) ⬇️
...sing/transformers/static_covariates_transformer.py 99.30% <98.95%> (+0.85%) ⬆️
darts/dataprocessing/transformers/diff.py 100.00% <100.00%> (ø)
...ocessing/transformers/fittable_data_transformer.py 98.36% <100.00%> (+1.69%) ⬆️
...essing/transformers/invertible_data_transformer.py 97.14% <100.00%> (+0.98%) ⬆️
darts/dataprocessing/transformers/mappers.py 100.00% <100.00%> (ø)
...taprocessing/transformers/missing_values_filler.py 100.00% <100.00%> (ø)
...arts/dataprocessing/transformers/reconciliation.py 99.04% <100.00%> (-0.04%) ⬇️
darts/dataprocessing/transformers/scaler.py 97.29% <100.00%> (-0.27%) ⬇️
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mabilton
Copy link
Contributor Author

mabilton commented Dec 5, 2022

So I see that the following line in cell [40] within the 00-quickstart.ipynb notebook is throwing an error:

# scale back:
pred_air = scaler.inverse_transform(pred_air)

From what I can see, scaler is fitted to two different timeseries in cell [31]:

scaler = Scaler()
train_air_scaled, train_milk_scaled = scaler.fit_transform([train_air, train_milk])

What's the intended behaviour for a data transformation which is trained on two series but is given only a single series to inverse transform? I was under the impression that an error should be thrown in such cases.

@mabilton
Copy link
Contributor Author

mabilton commented Dec 14, 2022

Hi there.

This is just a quick update from me: I've fixed the *Transformer classes so that they correctly handle being passed fewer timeseries than the number of timeseries they were trained on; all checks are now passing as a result. I've also added explicit tests for this behaviour.

Once again, any comments or suggestions on what I've done here are more than welcome : ) .

Cheers,
Matt.

@hrzn
Copy link
Contributor

hrzn commented Dec 16, 2022

Thanks a lot @mabilton ! We just need a bit more time to get to it and review :)

Copy link
Collaborator

@madtoinou madtoinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I wrote some minor comments about details.

Thank you for this PR that considerably simplify the API of the data transformers and will certainly solve many questions/issues related to the data transformers!

@mabilton
Copy link
Contributor Author

Hey @madtoinou - thanks for all the useful feedback. Just letting you know that I'm pretty busy in my personal life at the moment, so it'll probably take a day or two to implement your suggestions. Apologies for the delay.

Cheers,
Matt.

Copy link
Contributor

@hrzn hrzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks really good to me. We can almost merge it as such IMO :) Only got a few pretty minor comments. Nice job @mabilton! And sorry for the super late review...

@mabilton
Copy link
Contributor Author

mabilton commented Mar 4, 2023

Hey @hrzn , @madtoinou.

I've (finally) found some time to implement your suggestions in addition to the following changes:

  1. I've added more detail to the docstrings of the transformer classes/methods.
  2. I've added a 'global_fit' option to the FittableDataTransformer. When global_fit = True, the entire Sequence of TimeSeries passed to the fit method will be passed to ts_fit, thereby allowing the user to fit a single set of parameters to all of the (disjoint) TimeSeries provided by the user, as opposed to independently fitting a different set of parameters to each TimeSeries. At this point in time, only the BoxCox, Scaler, and StaticCovariatesTransformer (can) use this option. Indeed, my motivation for introducing this global_fit option was so I could refactor the StaticCovariatesTransformer, which I'll talk about now.
  3. In the last iteration of this PR, the StaticCovariatesTransformer was basically left untouched and was implemented in a way that did not utilise the refactored fit, transform, and inverse_transform methods. Since fitting the StaticCovariatesTransformer requires first collating the static covariates of all of the provided TimeSeries, I had to first had introduce a 'global fitting' option before refactoring StaticCovariatesTransformer (as mentioned in point 2). With that done, I could then refactor StaticCovariatesTransformer so that only a ts_transform, ts_inverse_transform, and a ts_fit method needed to be implemented (as opposed to directly overriding the transform, inverse_transform, and fit methods).

As an aside, it appears that statsforecast==1.5.0 (which was only released a few days ago) introduces some dependency conflicts, and resulted in some tests failing. It looks like this is a problem on the statsforecast side, so to get around this for the time being, I've specified statsforecast>=1.4,<1.5 in requirements/core.txt.

Hopefully that all makes sense, and please let me know what you guys think. Thanks in advance for any help.

Cheers,
Matt.

Copy link
Contributor

@hrzn hrzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice job @mabilton !

@madtoinou madtoinou merged commit 8972d2e into unit8co:master Mar 7, 2023
@madtoinou madtoinou mentioned this pull request Mar 7, 2023
alexcolpitts96 pushed a commit to alexcolpitts96/darts that referenced this pull request May 31, 2023
* Refactored data transformers classes.

* Fixed failing data transformer tests.

* Fixed minor bug in `test_diff.py` - `~ bool_var` should be `not bool_var`.

* Added missing `params` arg to pipeline test mock method.

* Added automatic `component_mask`ing of inputs/outputs.

* Added tests for data transformer classes.

* Refactored data transformers classes.

* Fixed failing data transformer tests.

* Fixed minor bug in `test_diff.py` - `~ bool_var` should be `not bool_var`.

* Added missing `params` arg to pipeline test mock method.

* Added automatic `component_mask`ing of inputs/outputs.

* Added tests for data transformer classes.

* Fixed bug when fewer timeseries specified than training timeseries.

* Updated tests to check for 'fewer inputs than training series' behaviour.

* Added `global_fit` option to `FittableDataTransformer`.

* Refactored `StaticCovariatesTransformer`.

* Added `global_fit` option to `BoxCox` and `Scaler` transforms.

* Removed `test_window_transformer_iterator` test, since `_transformer_iterator` method unused.

* Removed redundant `_*_iterator` methods of data transformers.

* Added more data transformer documentation + made `component_mask` argument explicit.

* `copy=False` in `apply_component_mask`.

* Removed documentation references to `_*_iterators`.

* Specified `statsforecast>=1.4,<1.5` to avoid dependency conflict.

---------

Co-authored-by: Julien Herzen <[email protected]>
Co-authored-by: madtoinou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic _*_iterators and component_masking for Data Transformers
4 participants