Refactorised tabularisation + Jupyter notebook w/ experiments. #1399

mabilton · 2022-11-29T23:27:45Z

Addresses #1308 - Speedup creation of lagged data.

Summary

Hi there.

I've started some preliminary work on refactoring /darts/utils/data/tabularization.py.

Here’s a basic outline of the changes I’ve made:

I’ve split up the original _create_lagged_data function into two: one which creates the X and y arrays for training/validation (create_lagged_features_and_labels), and another that creates the X array for predicting (create_lagged_features). The rationale behind this change is twofold:
- In my personal opinion, clearly dividing the ‘logic’ of constructing the test/validation arrays and the logic of constructing the prediction arrays makes these functions more easily maintainable, testable, and understandable to users.
- There’s some computational savings when constructing the prediction X array, since the y array is not constructed at all (note that in the current implementation of _create_lagged_data, y is still assembled even when is_training=False.
Instead of taking the union of all the specified series and then dropping nan columns, I instead first compute the intersection of all the dates found in each series and then lag the index of these common dates to construct the features/labels. The notable benefit of this approach is that it completely eliminates the for loops over the lag values.
Unlike _create_lagged_data, I've made the functions I've implemented explicitly public, in case users wish to implement their own algorithm with relies on assembling lagged values.

Other Information

From my very informal experiments, I’ve observed a ~10 fold speed-up on 'simple’ problems which involve a couple of lag values, and a ~40 fold speed-up on 'larger’ problems which involves many lag values (i.e. more than 10). If you’d like to run these experiments yourself, feel free to checkout the tabularization_experiments.ipynb notebook. For reference, these benchmarks were performed on a ~4 year old Dell XPS 15 laptop.

In saying all this, what I've done is still very much a work in progress. Notably:

I still need to write tests, although I do run 'correctness tests' in tabularization_experiments.ipynb over a very large number of input parameter combinations (10k+ combos), so I'm pretty confident my implementation here is correct.
I still haven't quite got the prediction data generation working. More specifically, I'm a bit confused about the intended behaviour of _create_lagged_data when is_training=False. To see what I mean by this, please refer to the 'Understanding Behaviour of _create_lagged_data when is_training=False' section of tabularization_experiments.ipynb. For this example, why is X = [15., 32., 61.] and Ts = [6]? In particular, the 61 value contributed by future_series to X is only -1 lags away from the Ts = 6 value, not -3 lags away?
I notice that _create_lagged_data has an outer for loop over all of the specified target_series. What's the precise reason for this? Is there any assumption that we can make about the shapes of the timeseries its iterating over? In particular, would it be possible to first concatenate all the specified timeseries together, create the lagged variables, and then np.split the result at the end?
I believe that in special situations where past_series, future_series, and target_series are all sampled at the same frequency, using something like numpy.lib.stride_tricks.sliding_window_view would probably be even faster than what I've written here, so that might be worth looking at.

Any comments/feedback on what I've done here would be very welcome - thanks in advance for any help.

Cheers,
Matt.

review-notebook-app · 2022-11-29T23:27:49Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov-commenter · 2022-11-29T23:50:29Z

Codecov Report

Base: 93.97% // Head: 94.06% // Increases project coverage by +0.08% 🎉

Coverage data is based on head (557d100) compared to base (5483e2f).
Patch coverage: 99.37% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1399      +/-   ##
==========================================
+ Coverage   93.97%   94.06%   +0.08%     
==========================================
  Files         122      122              
  Lines       10744    10960     +216     
==========================================
+ Hits        10097    10309     +212     
- Misses        647      651       +4

Impacted Files	Coverage Δ
darts/models/forecasting/regression_model.py	`97.06% <95.65%> (-0.26%)`	⬇️
darts/explainability/shap_explainer.py	`87.98% <100.00%> (+0.05%)`	⬆️
darts/utils/data/tabularization.py	`100.00% <100.00%> (ø)`
darts/timeseries.py	`92.12% <0.00%> (-0.23%)`	⬇️
darts/ad/anomaly_model/filtering_am.py	`91.93% <0.00%> (-0.13%)`	⬇️
...arts/models/forecasting/torch_forecasting_model.py	`89.52% <0.00%> (-0.05%)`	⬇️
darts/models/forecasting/block_rnn_model.py	`98.24% <0.00%> (-0.04%)`	⬇️
darts/models/forecasting/nhits.py	`99.27% <0.00%> (-0.01%)`	⬇️
darts/datasets/__init__.py	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

eliane-maalouf · 2022-12-02T16:56:12Z

Thank you for this submission. We will be looking at it shortly and get back to you as soon as possible.

mabilton · 2022-12-02T23:28:53Z

Hey @eliane-maalouf , thanks for that : )

eliane-maalouf

I think this refactoring could be a nice improvement to the tabularization function.
My main propositions/comments would be the following :

I would recommend to keep the calls to tabularization code under the function name _create_lagged_data and allow the specification of the function call as being done from fit() or from predict(). IMO this is less risky in terms of breaking code that is depending on this function (at least the existing unittest should still pass).
Currently, only fit() is using the tabularization code being refactored. predict() is redoing a similar process (without calling _create_lagged_data) and allowing to add previous predictions when the prediction horizon n > output_chunk_length. It might be nice to unify the features generation across the different functions. It would make the code easier to maintain indeed. As for the is_training argument is currently used in shap_explainer.py which also needs to build features for inference.

darts/utils/data/tabularization.py

mabilton · 2022-12-04T23:48:29Z

Hey @eliane-maalouf - thanks for your insightful comments : ). Just a few comments/questions from me:

It seems to me that _create_lagged_data is being treated as a de-facto public method (i.e. other parts of the code base are treating _create_lagged_data as a function with a stable interface which is meant to be called outside of the file it is defined in). In which case, would it be worth it to just 'bite the bullet' and make create_lagged_data an explicitly public method, since:
- This would clearly indicate to future users/contributors that create_lagged_data is meant to be called by other parts of the code base and, therefore, that changes made to that function may be breaking.
- From my understanding, it's generally considered a 'code smell' if a private method/function is being explicitly tested/called outside of the file it's defined in; here's a pretty good discussion about this point on StackExchange.
- Searching over the code base, it appears that only shap_explainer.py, catboost_model.py, gradient_boost_model.py, regression_model.py, and test_regression_models.py explicitly call _create_lagged_data, so I don't suspect it would be a huge job to change these references.
I didn't realise that RegressionModel.predict was basically 're-implementing' prediction feature generation; when I have time, I'll look over this and get back to you.
Fair enough point on 'unifying' training and prediction feature generation under a single function using the is_training argument. One option which gives us the 'best of both worlds' might be to have an additionally helper function that calls either create_lagged_features_and_labels or create_lagged_features depending on whether is_training.

eliane-maalouf · 2022-12-08T16:08:35Z

Hello @mabilton , to follow up on this PR,

we propose to leave out the predict part for another PR, this way this PR can remain limited in scope for the tabularization function. We are creating a separate issue for the unification of fit() and predict() feature generation.
Indeed only few classes are using create_lagged_data. So maybe the best way forward is that you supplement this PR with the changes using your functions and make sure that the current unittest pass correctly so we can validate that the proposed changes are not breaking the current state of the library.
concerning making create_lagged_data a public method, I think I will ask @hrzn for his opinion on this design choice.
In the unittests, if I am not mistaken, only the output of create_lagged_data is explicitly tested against an expected output, but the inner workings of the function are not explicitly tested. In the case of the modifications you are proposing, I see that the core change is in _find_intersecting_times so explicitly testing the expected output of this function would still be useful IMO. WDYT?

hrzn · 2022-12-08T16:47:19Z

Hi @mabilton and thanks a lot for this proposal! As far as I can tell it looks good and could be included in an upcoming release. We will need a bit more time to review thoroughly :) Few points already:

As proposed by @eliane-maalouf, I would be in favour that you drop the old function in this PR, and replace it by yours :) This way we can truly assess what the change would look like and what the current unit tests tell us.
I agree with you it probably makes more sense to have create_lagged_data public at this point. It's probably not something that we would expect users to call often, but we can expose it. You can make changes in RegressionModels to update the name like you propose.

Thanks again!

darts/utils/data/tabularization.py

mabilton · 2022-12-11T19:17:53Z

Hi @eliane-maalouf, @hrzn - just a quick update from me.

I've managed to implement the 'sliding window' method for equal frequency series I mentioned in my initial PR; as expected, it tends to be 2 - 3 times faster than using the 'time intersection' method when all of the specified series are of the same frequency. In implementing this, I also refactored what I previously had. Notably:

I've split up what was _find_time_intersection into two new functions: get_feature_times, which returns the times in each series which could be used to create features, and get_shared_times, which just finds the temporal intersection of an arbitrary number of time indices / TimeSeries. Splitting up the logic of these two functions hopefully makes the code more understandable, and should make it easier to explicitly test behaviors.
I've added a check_inputs argument, so that classes that have already done input checks don't need to repeat these inside of create_lagged_data.
I've updated the tabularization_experiments.ipynb to cover more test cases. As a summary, the greatest computational gains compared to _create_lagged_data is for unequal frequency data when max_samples_per_ts is specified to a small value (~200+ times faster). This is because, in its current form, _create_lagged_data computes all of the possible features and then takes the last max_samples_per_ts samples, whereas the newly implemented create_lagged_data avoids computing features it doesn't need. Also, just to be clear, I will remove tabularization_experiments.ipynb and _create_lagged_data from my PR once everything is ready - they're just there in the mean time for comparison purposes.
Where possible, I've factored in your comments on my previous code - thanks for the help : )

In its current form, create_lagged_data still doesn't fully comply with what's expected by other parts of the library, so tests won't currently pass. One difficulty I'm having is that I don't quite understand how _create_lagged_data is meant to behave when is_training = False. To understand what I mean by this, it would be appreciated if you could have a quick look over the Unexpected Behaviour of _create_lagged_data when is_training=False section of tabularization_experiments.ipynb, where I go over a simple example where _create_lagged_data shows some (perhaps?) unexpected when is_training = False. Once I understand what's supposed to happen when is_training = False, I should be ready to alter what I have to comply with the tests.

Hopefully what I've written here makes sense - let me know if you have any questions. Once again, thanks for all your help : ).

Cheers,
Matt.

eliane-maalouf · 2022-12-16T15:56:01Z

@mabilton thanks for the update, I will have a look next week.

hrzn · 2022-12-22T16:05:04Z

Thanks a lot @mabilton ! This looks really promising. Please bear with us (we are slow to review atm - busy preparing v0.23). I'm hopeful we can add your refactoring and improvements into v0.24.

mabilton · 2022-12-22T19:56:56Z

Hey @hrzn - no worries! All of this can definitely wait until after Christmas and the New Year; in any case, I'm currently in the process of writing up tests for what I have (and fixing bugs that I discover while doing this), so it's probably for the best that you hold off on reviewing what I have for the time being. Thanks for all your hard work @hrzn and @eliane-maalouf - have a Merry Christmas and Happy New Year : ).

Cheers,
Matt.

hrzn · 2022-12-22T20:31:37Z

Hey @hrzn - no worries! All of this can definitely wait until after Christmas and the New Year; in any case, I'm currently in the process of writing up tests for what I have (and fixing bugs that I discover while doing this), so it's probably for the best that you hold off on reviewing what I have for the time being. Thanks for all your hard work @hrzn and @eliane-maalouf - have a Merry Christmas and Happy New Year : ).

Cheers,
Matt.

Thank you @mabilton , happy new year to you too!

…odifications.

…n_model.py`.

eliane-maalouf

Thanks @mabilton for this work. I have mainly went through the static covariates part and added some comments.

eliane-maalouf · 2023-01-14T18:12:58Z

darts/models/forecasting/regression_model.py

-            *self.extreme_lags,
-            past_covariates=past_covariates,
-            future_covariates=future_covariates,
-            max_samples_per_ts=max_samples_per_ts,
        )


wouldn't calling _add_static_covariates() better done inside the preceding for loop? in my opinion, this would help simplify the _add_static_covariates() internal logic since, if it only takes a specific target (from the input sequence) and its specific features, than one can avoid looping twice over the series inside the _add_static_covariates(), if I understood the implementation correctly. WDYT?
In the current _add_static_covariates() my assumption was that the function will receive all the features and the function should compute back everything it needs in terms of length and width of features, since I was expecting a change in the _create_lagged_data() outputs, but this might not be potentially relevant anymore with the changes you made.

It would still be necessary though to go through all the series in the input sequence once, prior, to collect the static covariates information from all of them.

Definitely - in my opinion, the static covariates would ideally be added inside of create_lagged_data after each 'block' has been formed, but that would probably be a bit clumsy to implement at the moment since the process of computing the static covariates requires the n_features_in_ attribute of the RegressionModel object. Perhaps something to think about for a future PR?

darts/models/forecasting/regression_model.py

darts/tests/explainability/test_shap_explainer.py

darts/tests/models/forecasting/test_regression_models.py

eliane-maalouf · 2023-01-14T18:53:38Z

concerning your previous question about a test to check the order of the static covariates, I think it would be a good idea to make sure that the implementation is working as expected on this regard.

mabilton · 2023-01-14T22:19:23Z

Hey @hrzn, @eliane-maalouf - thank you both for all your comments. I'll try to work through them ASAP.

Co-authored-by: eliane-maalouf <[email protected]>

mabilton · 2023-01-16T08:59:41Z

Hey @hrzn, @eliane-maalouf - just letting you both know that I've read over all your very useful comments and that I'm in the process of implementing the suggested changes. In particular, I've had to make a few adjustments to the tabularization and testing code to account for the case where lags_future_covariates is positive. I've been a bit busy in my day-to-day life recently, so these changes may take a few more days to complete - apologies for the delay.

…variates < 0`.

…trings; test `Sequence[TimeSeries]` inputs and stochasic inputs.

mabilton · 2023-01-19T10:00:43Z

Hey @hrzn, @eliane-maalouf . So (I think) I've managed to address pretty much all of your comments in my latest push - if I've missed something, please let me know and apologies in advance.

There are two outstanding issues I haven't really addressed yet:

I don't think we've come to an agreement about where _add_static_covariates should be placed. In my personal opinion, it would be ideal if we could somehow refactor _add_static_covariates' so that it doesn't need access to the underlying RegressionModels object. That way, we could think about adding static covariates to each block inside of create_lagged_data as soon as each block is formed. What do you guys reckon?
I still haven't had a chance to add an additional static covariates test, just to make sure that the actual values being appended are correct - I'll try squeeze that in before this PR is merged if I find time.

Finally, after mulling over it some more, I think that the 'moving window' method can actually be used for series of different frequencies by correctly selecting the stride parameter. I think that's a bit outside the scope of this PR though, but I can perhaps look to implement that alongside #1487.

Once again, any and all comments are welcome, and thanks in advance for any help - apologies for the delay in making these amendments.

Cheers,
Matt.

mabilton · 2023-01-19T10:08:33Z

Looks like the 04-RNN-examples.ipynb notebook is failing since tensorboardX can't be imported - any ideas what's going on here?

hrzn · 2023-01-20T10:46:41Z

Looks like the 04-RNN-examples.ipynb notebook is failing since tensorboardX can't be imported - any ideas what's going on here?

Hi @mabilton and many thanks for revising your PR. Will check your updates very soon.
Regarding the issues with the CI tests, they are due to a recent release of PyTorch Lightning which was breaking a few things in our tests. I have now fixed this in master so the issues should be ✅
Thanks!

…columns are explicitly checked.

mabilton · 2023-01-21T01:07:19Z

Hey @hrzn, thanks for the update + bug fix.

I've finished amending one of the static covariates tests so that the static covariate values appended to the feature matrix are explicitly checked (as opposed to just checking that the shape of the feature matrix returned by RegressionModel._create_lagged_data is correct).

To perhaps clarify what I mentioned in my previous comment about appending on the static covariate columns as soon as each feature matrix block is constructed, it might pay to have a look at helper_get_static_covs_expected_X in test_regression_models.py - this is basically a simplified implementation of the strategy I'm suggesting. Thoughts on this?

Cheers,
Matt.

hrzn

Great work @mabilton !
Only got one small comment left regarding the docstring for preventing lag=0 for past covariates.

Regarding your comment about the alternative way of adding the static covariates to the features array, as far as I can tell it makes sense 👍 I would suggest we wait for another PR and close this one first though.

Many thanks again, great stuff!

hrzn · 2023-01-22T14:12:05Z

darts/utils/data/tabularization.py

+
+    The `lags` specified for the `target_series` must all be less than or equal to -1 (i.e. one can't use the value
+    of the target series at time `t` to predict the target series at the same time `t`). Conversely, the values in
+    `lags_past_covariates` and/or `lags_future_covariates` must be less than or equal to 0 (i.e. we *are* able to


I think this part of the docstring still requires adaptation, right?

mabilton · 2023-01-22T19:05:17Z

Hey @hrzn, thanks for your kind feedback. Good spot with the docstring - I've fixed that now. Thanks to both you and @eliane-maalouf for your help throughout this PR.

hrzn · 2023-01-23T10:28:42Z

Hey @hrzn, thanks for your kind feedback. Good spot with the docstring - I've fixed that now. Thanks to both you and @eliane-maalouf for your help throughout this PR.

Great thanks @mabilton :) merging now 🚀

Refactorised tabularisation + Jupyter notebook w/ experiments.

52bee74

eliane-maalouf reviewed Dec 4, 2022

View reviewed changes

darts/utils/data/tabularization.py Show resolved Hide resolved

darts/utils/data/tabularization.py Show resolved Hide resolved

Merge branch 'unit8co:master' into refactor/tabularization

748caf5

hrzn reviewed Dec 8, 2022

View reviewed changes

darts/utils/data/tabularization.py Outdated Show resolved Hide resolved

darts/utils/data/tabularization.py Outdated Show resolved Hide resolved

darts/utils/data/tabularization.py Outdated Show resolved Hide resolved

darts/utils/data/tabularization.py Outdated Show resolved Hide resolved

Matthew Bilton added 2 commits December 11, 2022 21:45

Added 'moving window' method + refactored 'time intersection' method.

83d78b1

Merge branch 'master' into refactor/tabularization

9356cd7

mabilton added 4 commits December 18, 2022 14:05

Refactoring/code simplification + bug fixes.

85011fe

Added is_training flag.

79b4fe7

Added tests + bug fixes.

c64b330

Merge branch 'master' into refactor/tabularization

ab48acf

mabilton added 8 commits December 27, 2022 23:22

More tests + debugging.

ededa0a

Fixed zero lag value not allowed bug + other debugging.

9e954db

RegressionModel now calls create_lagged_training_data + passing tests.

ab548c5

ShapExplainer now uses create_lagged_prediction_data + minor test m…

fd3e5ce

…odifications.

Merge branch 'master' into refactor/tabularization

8be63d1

Added further documentation, esp to tests.

5b25175

Moved _add_static_covariates from tabularization.py to `regressio…

9800cbd

…n_model.py`.

Merge branch 'master' into refactor/tabularization

6298bb1

eliane-maalouf reviewed Jan 14, 2023

View reviewed changes

mabilton and others added 3 commits January 15, 2023 12:57

typo fix in test_regression_models.py

479ba8d

Co-authored-by: eliane-maalouf <[email protected]>

Removed old _create_lagged_data and tests notebook.

7fa3eea

Clarification about check_inputs in docstring.

2766c68

hrzn mentioned this pull request Jan 15, 2023

Leverage stride as a way to limit training set size in RegressionModels. #1487

Open

mabilton added 7 commits January 19, 2023 18:16

Allow lags_future_covariates to be > 0, and enforce `lags_past_co…

6bda371

…variates < 0`.

Made get_feature_times private, now _get_feature_times.

722765d

Placed for loop back inside create_lagged_data; more info in docs…

9baa820

…trings; test `Sequence[TimeSeries]` inputs and stochasic inputs.

Fixed bootstrap=True in test_regression_models.py.

cadb51c

Added note about np.split in regression_model.py.

7002447

Fixed repeated static covariates width calculation.

e38e240

Fixed shap_explainer bug.

a4ee267

Merge branch 'master' into refactor/tabularization

ca54b13

hrzn and others added 2 commits January 20, 2023 11:46

Merge branch 'master' into refactor/tabularization

8dab315

Amended static covariates test so that values of appended static cov …

7845c60

…columns are explicitly checked.

Merge branch 'master' into refactor/tabularization

4c7c163

hrzn approved these changes Jan 22, 2023

View reviewed changes

Updated docstring error.

557d100

hrzn merged commit 864d190 into unit8co:master Jan 23, 2023

eliane-maalouf mentioned this pull request Jan 23, 2023

Simplify the logic of _add_static_covariate in regression models #1507

Open

mabilton deleted the refactor/tabularization branch January 24, 2023 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactorised tabularisation + Jupyter notebook w/ experiments. #1399

Refactorised tabularisation + Jupyter notebook w/ experiments. #1399

mabilton commented Nov 29, 2022

review-notebook-app bot commented Nov 29, 2022

codecov-commenter commented Nov 29, 2022 •

edited

Loading

eliane-maalouf commented Dec 2, 2022

mabilton commented Dec 2, 2022

eliane-maalouf left a comment •

edited

Loading

mabilton commented Dec 4, 2022

eliane-maalouf commented Dec 8, 2022

hrzn commented Dec 8, 2022

mabilton commented Dec 11, 2022

eliane-maalouf commented Dec 16, 2022

hrzn commented Dec 22, 2022

mabilton commented Dec 22, 2022 •

edited

Loading

hrzn commented Dec 22, 2022

eliane-maalouf left a comment

eliane-maalouf Jan 14, 2023

eliane-maalouf Jan 14, 2023

mabilton Jan 19, 2023

eliane-maalouf commented Jan 14, 2023

mabilton commented Jan 14, 2023

mabilton commented Jan 16, 2023

mabilton commented Jan 19, 2023

mabilton commented Jan 19, 2023

hrzn commented Jan 20, 2023

mabilton commented Jan 21, 2023

hrzn left a comment

hrzn Jan 22, 2023

mabilton commented Jan 22, 2023

hrzn commented Jan 23, 2023

Refactorised tabularisation + Jupyter notebook w/ experiments. #1399

Refactorised tabularisation + Jupyter notebook w/ experiments. #1399

Conversation

mabilton commented Nov 29, 2022

Summary

Other Information

review-notebook-app bot commented Nov 29, 2022

codecov-commenter commented Nov 29, 2022 • edited Loading

Codecov Report

eliane-maalouf commented Dec 2, 2022

mabilton commented Dec 2, 2022

eliane-maalouf left a comment • edited Loading

Choose a reason for hiding this comment

mabilton commented Dec 4, 2022

eliane-maalouf commented Dec 8, 2022

hrzn commented Dec 8, 2022

mabilton commented Dec 11, 2022

eliane-maalouf commented Dec 16, 2022

hrzn commented Dec 22, 2022

mabilton commented Dec 22, 2022 • edited Loading

hrzn commented Dec 22, 2022

eliane-maalouf left a comment

Choose a reason for hiding this comment

eliane-maalouf Jan 14, 2023

Choose a reason for hiding this comment

eliane-maalouf Jan 14, 2023

Choose a reason for hiding this comment

mabilton Jan 19, 2023

Choose a reason for hiding this comment

eliane-maalouf commented Jan 14, 2023

mabilton commented Jan 14, 2023

mabilton commented Jan 16, 2023

mabilton commented Jan 19, 2023

mabilton commented Jan 19, 2023

hrzn commented Jan 20, 2023

mabilton commented Jan 21, 2023

hrzn left a comment

Choose a reason for hiding this comment

hrzn Jan 22, 2023

Choose a reason for hiding this comment

mabilton commented Jan 22, 2023

hrzn commented Jan 23, 2023

codecov-commenter commented Nov 29, 2022 •

edited

Loading

eliane-maalouf left a comment •

edited

Loading

mabilton commented Dec 22, 2022 •

edited

Loading