-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactorised tabularisation + Jupyter notebook w/ experiments. #1399
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportBase: 93.97% // Head: 94.06% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #1399 +/- ##
==========================================
+ Coverage 93.97% 94.06% +0.08%
==========================================
Files 122 122
Lines 10744 10960 +216
==========================================
+ Hits 10097 10309 +212
- Misses 647 651 +4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Thank you for this submission. We will be looking at it shortly and get back to you as soon as possible. |
Hey @eliane-maalouf , thanks for that : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this refactoring could be a nice improvement to the tabularization function.
My main propositions/comments would be the following :
- I would recommend to keep the calls to tabularization code under the function name
_create_lagged_data
and allow the specification of the function call as being done fromfit()
or frompredict()
. IMO this is less risky in terms of breaking code that is depending on this function (at least the existing unittest should still pass). - Currently, only
fit()
is using the tabularization code being refactored.predict()
is redoing a similar process (without calling_create_lagged_data
) and allowing to add previous predictions when the prediction horizonn > output_chunk_length
. It might be nice to unify the features generation across the different functions. It would make the code easier to maintain indeed. As for theis_training
argument is currently used inshap_explainer.py
which also needs to build features for inference.
Hey @eliane-maalouf - thanks for your insightful comments : ). Just a few comments/questions from me:
|
Hello @mabilton , to follow up on this PR,
|
Hi @mabilton and thanks a lot for this proposal! As far as I can tell it looks good and could be included in an upcoming release. We will need a bit more time to review thoroughly :) Few points already:
Thanks again! |
Hi @eliane-maalouf, @hrzn - just a quick update from me. I've managed to implement the 'sliding window' method for equal frequency series I mentioned in my initial PR; as expected, it tends to be 2 - 3 times faster than using the 'time intersection' method when all of the specified series are of the same frequency. In implementing this, I also refactored what I previously had. Notably:
In its current form, Hopefully what I've written here makes sense - let me know if you have any questions. Once again, thanks for all your help : ). Cheers, |
@mabilton thanks for the update, I will have a look next week. |
Thanks a lot @mabilton ! This looks really promising. Please bear with us (we are slow to review atm - busy preparing v0.23). I'm hopeful we can add your refactoring and improvements into v0.24. |
Hey @hrzn - no worries! All of this can definitely wait until after Christmas and the New Year; in any case, I'm currently in the process of writing up tests for what I have (and fixing bugs that I discover while doing this), so it's probably for the best that you hold off on reviewing what I have for the time being. Thanks for all your hard work @hrzn and @eliane-maalouf - have a Merry Christmas and Happy New Year : ). Cheers, |
Thank you @mabilton , happy new year to you too! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mabilton for this work. I have mainly went through the static covariates part and added some comments.
*self.extreme_lags, | ||
past_covariates=past_covariates, | ||
future_covariates=future_covariates, | ||
max_samples_per_ts=max_samples_per_ts, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't calling _add_static_covariates() better done inside the preceding for loop? in my opinion, this would help simplify the _add_static_covariates() internal logic since, if it only takes a specific target (from the input sequence) and its specific features, than one can avoid looping twice over the series inside the _add_static_covariates(), if I understood the implementation correctly. WDYT?
In the current _add_static_covariates() my assumption was that the function will receive all the features and the function should compute back everything it needs in terms of length and width of features, since I was expecting a change in the _create_lagged_data() outputs, but this might not be potentially relevant anymore with the changes you made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still be necessary though to go through all the series in the input sequence once, prior, to collect the static covariates information from all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely - in my opinion, the static covariates would ideally be added inside of create_lagged_data
after each 'block' has been formed, but that would probably be a bit clumsy to implement at the moment since the process of computing the static covariates requires the n_features_in_
attribute of the RegressionModel
object. Perhaps something to think about for a future PR?
concerning your previous question about a test to check the order of the static covariates, I think it would be a good idea to make sure that the implementation is working as expected on this regard. |
Hey @hrzn, @eliane-maalouf - thank you both for all your comments. I'll try to work through them ASAP. |
Hey @hrzn, @eliane-maalouf - just letting you both know that I've read over all your very useful comments and that I'm in the process of implementing the suggested changes. In particular, I've had to make a few adjustments to the tabularization and testing code to account for the case where |
…trings; test `Sequence[TimeSeries]` inputs and stochasic inputs.
Hey @hrzn, @eliane-maalouf . So (I think) I've managed to address pretty much all of your comments in my latest push - if I've missed something, please let me know and apologies in advance. There are two outstanding issues I haven't really addressed yet:
Finally, after mulling over it some more, I think that the 'moving window' method can actually be used for series of different frequencies by correctly selecting the Once again, any and all comments are welcome, and thanks in advance for any help - apologies for the delay in making these amendments. Cheers, |
Looks like the |
Hi @mabilton and many thanks for revising your PR. Will check your updates very soon. |
…columns are explicitly checked.
Hey @hrzn, thanks for the update + bug fix. I've finished amending one of the static covariates tests so that the static covariate values appended to the feature matrix are explicitly checked (as opposed to just checking that the shape of the feature matrix returned by To perhaps clarify what I mentioned in my previous comment about appending on the static covariate columns as soon as each feature matrix block is constructed, it might pay to have a look at Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @mabilton !
Only got one small comment left regarding the docstring for preventing lag=0 for past covariates.
Regarding your comment about the alternative way of adding the static covariates to the features array, as far as I can tell it makes sense 👍 I would suggest we wait for another PR and close this one first though.
Many thanks again, great stuff!
darts/utils/data/tabularization.py
Outdated
|
||
The `lags` specified for the `target_series` must all be less than or equal to -1 (i.e. one can't use the value | ||
of the target series at time `t` to predict the target series at the same time `t`). Conversely, the values in | ||
`lags_past_covariates` and/or `lags_future_covariates` must be less than or equal to 0 (i.e. we *are* able to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part of the docstring still requires adaptation, right?
Hey @hrzn, thanks for your kind feedback. Good spot with the docstring - I've fixed that now. Thanks to both you and @eliane-maalouf for your help throughout this PR. |
Great thanks @mabilton :) merging now 🚀 |
Addresses #1308 - Speedup creation of lagged data.
Summary
Hi there.
I've started some preliminary work on refactoring
/darts/utils/data/tabularization.py
.Here’s a basic outline of the changes I’ve made:
_create_lagged_data
function into two: one which creates theX
andy
arrays for training/validation (create_lagged_features_and_labels
), and another that creates theX
array for predicting (create_lagged_features
). The rationale behind this change is twofold:X
array, since they
array is not constructed at all (note that in the current implementation of_create_lagged_data
,y
is still assembled even whenis_training=False
.nan
columns, I instead first compute the intersection of all the dates found in each series and then lag the index of these common dates to construct the features/labels. The notable benefit of this approach is that it completely eliminates thefor
loops over thelag
values._create_lagged_data
, I've made the functions I've implemented explicitly public, in case users wish to implement their own algorithm with relies on assembling lagged values.Other Information
From my very informal experiments, I’ve observed a ~10 fold speed-up on 'simple’ problems which involve a couple of lag values, and a ~40 fold speed-up on 'larger’ problems which involves many lag values (i.e. more than 10). If you’d like to run these experiments yourself, feel free to checkout the
tabularization_experiments.ipynb
notebook. For reference, these benchmarks were performed on a ~4 year old Dell XPS 15 laptop.In saying all this, what I've done is still very much a work in progress. Notably:
tabularization_experiments.ipynb
over a very large number of input parameter combinations (10k+ combos), so I'm pretty confident my implementation here is correct._create_lagged_data
whenis_training=False
. To see what I mean by this, please refer to the 'Understanding Behaviour of_create_lagged_data
whenis_training=False
' section oftabularization_experiments.ipynb
. For this example, why isX = [15., 32., 61.]
andTs = [6]
? In particular, the61
value contributed byfuture_series
toX
is only-1
lags away from theTs = 6
value, not-3
lags away?_create_lagged_data
has an outerfor
loop over all of the specifiedtarget_series
. What's the precise reason for this? Is there any assumption that we can make about the shapes of the timeseries its iterating over? In particular, would it be possible to first concatenate all the specified timeseries together, create the lagged variables, and thennp.split
the result at the end?past_series
,future_series
, andtarget_series
are all sampled at the same frequency, using something likenumpy.lib.stride_tricks.sliding_window_view
would probably be even faster than what I've written here, so that might be worth looking at.Any comments/feedback on what I've done here would be very welcome - thanks in advance for any help.
Cheers,
Matt.