-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor/data_transformers #1409
Refactor/data_transformers #1409
Conversation
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #1409 +/- ##
=======================================
Coverage 94.05% 94.06%
=======================================
Files 125 125
Lines 11185 11248 +63
=======================================
+ Hits 10520 10580 +60
- Misses 665 668 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
So I see that the following line in cell # scale back:
pred_air = scaler.inverse_transform(pred_air) From what I can see, scaler = Scaler()
train_air_scaled, train_milk_scaled = scaler.fit_transform([train_air, train_milk]) What's the intended behaviour for a data transformation which is trained on two series but is given only a single series to inverse transform? I was under the impression that an error should be thrown in such cases. |
Hi there. This is just a quick update from me: I've fixed the Once again, any comments or suggestions on what I've done here are more than welcome : ) . Cheers, |
Thanks a lot @mabilton ! We just need a bit more time to get to it and review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I wrote some minor comments about details.
Thank you for this PR that considerably simplify the API of the data transformers and will certainly solve many questions/issues related to the data transformers!
Hey @madtoinou - thanks for all the useful feedback. Just letting you know that I'm pretty busy in my personal life at the moment, so it'll probably take a day or two to implement your suggestions. Apologies for the delay. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks really good to me. We can almost merge it as such IMO :) Only got a few pretty minor comments. Nice job @mabilton! And sorry for the super late review...
darts/tests/dataprocessing/transformers/test_fittable_data_transformer.py
Outdated
Show resolved
Hide resolved
…iterator` method unused.
…ton/darts into refactor/data_transformers
Hey @hrzn , @madtoinou. I've (finally) found some time to implement your suggestions in addition to the following changes:
As an aside, it appears that Hopefully that all makes sense, and please let me know what you guys think. Thanks in advance for any help. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice job @mabilton !
* Refactored data transformers classes. * Fixed failing data transformer tests. * Fixed minor bug in `test_diff.py` - `~ bool_var` should be `not bool_var`. * Added missing `params` arg to pipeline test mock method. * Added automatic `component_mask`ing of inputs/outputs. * Added tests for data transformer classes. * Refactored data transformers classes. * Fixed failing data transformer tests. * Fixed minor bug in `test_diff.py` - `~ bool_var` should be `not bool_var`. * Added missing `params` arg to pipeline test mock method. * Added automatic `component_mask`ing of inputs/outputs. * Added tests for data transformer classes. * Fixed bug when fewer timeseries specified than training timeseries. * Updated tests to check for 'fewer inputs than training series' behaviour. * Added `global_fit` option to `FittableDataTransformer`. * Refactored `StaticCovariatesTransformer`. * Added `global_fit` option to `BoxCox` and `Scaler` transforms. * Removed `test_window_transformer_iterator` test, since `_transformer_iterator` method unused. * Removed redundant `_*_iterator` methods of data transformers. * Added more data transformer documentation + made `component_mask` argument explicit. * `copy=False` in `apply_component_mask`. * Removed documentation references to `_*_iterators`. * Specified `statsforecast>=1.4,<1.5` to avoid dependency conflict. --------- Co-authored-by: Julien Herzen <[email protected]> Co-authored-by: madtoinou <[email protected]>
Fixes #1407.
Summary
This PR makes two major changes to how data transformers work:
ts_transform
,ts_inverse_transform
, andts_fit
by default, without the user having to re-implement_transform_iterator
,_inverse_transform_iterator
, or_fit_iterator
. To accommodate this, each of thesets_*
methods now accepts aparams
(dictionary) argument, whereparams['fixed']
stores the fixed parameters of the transformation (defined to be all those attributes defined in the child-most class before callingsuper().__init__
) andparams['fitted']
stores the fitted parameters (i.e. whatts_fit
returned).component_mask
key word arguments will be automatically applied to timeseries inputs given tots_*
and automatically 'unapplied' to timeseries outputs returned by these methods, which means that users don't have to worry about 'manually' dealing with these arguments inside of their implementedts_*
methods. If the user does not wish forcomponent_masks
to be automatically applied, they may specifymask_components=False
when callingsuper().__init__
; this will cause anycomponent_mask
key word argument to be passed viakwargs
to the called method (i.e. current behaviour).To see how these changes can help simplify the work involved in implementing a new transformation, compare the current implementation of
BoxCox
and with theBoxCox
implementation in this PR.Other Information
Some other minor changes that come with this PR:
_reshape_in
method into two new methods (apply_component_mask
andstack_samples
), as well as_reshape_out
into two new methods (unapply_component_mask
andunstack_samples
). There are two reasons for this change:_reshape_in
and_reshape_out
were responsible for applying two distinctly different changes to the data: masking component columns and stacking the samples of each component along a single axis. From a user interaction and maintainability perspective, I think it's much cleaner to have these two pieces of functionality separated from one another. The names_reshape_in
and_reshape_out
are, in my opinion, also a bit vague._reshape_in
/_reshape_out
, the 'stacking step' was performed using afor
loop; I've changed this so that onlynp.swapaxes
andnp.reshape
operations are used , which theoretically should speed things up a bit.StaticCovariatesTransformer
, which directly overrides thefit
,inverse_transform
andtransform
methods anyways. Similarly, I also had to make some minor adjustments to existing tests.BaseDataTransformer
,FittableDataTransformer
, andInvertableDataTransformer
classes.BoxCox
, allow for different fixed parameter values to be distributed over different parallel jobs. To facillitate this, I've added aparallel_params
argument to the*Transformer
classes, which allows the user to specify which parameters should take different values for different parallel jobs.There are two drawbacks to what I've done here:
darts
transformations (although I could be wrong).super().__init__
after initialising the fixed parameters of their transformation. For example, the following will allow the user to access'_my_param'
inparams['fixed']
:Any thoughts/comments on these changes are more than welcome.
Cheers,
Matt.