Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for Scalers applied on multiple series #1288

Closed
maximilianreimer opened this issue Oct 13, 2022 · 9 comments
Closed

Improvements for Scalers applied on multiple series #1288

maximilianreimer opened this issue Oct 13, 2022 · 9 comments
Labels
good first issue Good for newcomers improvement New feature or improvement

Comments

@maximilianreimer
Copy link

Describe the bug
If a Scaler is fitted on n_fitted sequences at once. It will always only returns n_fitted Sequences but not the number of inputted sequences.

To Reproduce

     scaler = Scaler()
    fitted_n = 2
    predicted_n = 3
    s = TimeSeries.from_times_and_values(pd.date_range("2022-01-01", "2022-01-10"), range(10))

    scaler.fit_transform([s]*fitted_n)

    ss_scaled = scaler.transform([s]*predicted_n)
    ss_scaled_inverted = scaler.inverse_transform(ss_scaled)
    ss_inverted = scaler.inverse_transform([s]*predicted_n)

    assert len(ss_scaled) == predicted_n # fails == fitted_n
    assert len(ss_scaled_inverted) == predicted_n # fails == fitted_n
    assert  len(ss_inverted) == predicted_n # fails == fitted_n

Expected behavior
Should scale all series independently and return the same number as inputed.

System (please complete the following information):

  • Python version: 3.7
  • darts version: 0.21.
@maximilianreimer maximilianreimer added bug Something isn't working triage Issue waiting for triaging labels Oct 13, 2022
@maximilianreimer
Copy link
Author

maximilianreimer commented Oct 13, 2022

Ok after some reading I think this might be intent behavior.

If so I have a question / Suggestion:

As I understand it now, if multiple TimeSeries are passed to fit a FittableDataTransformer (the Scaler is one) it will effectively create multiple FittableDataTransformer for each position of the sequence. On transform the "sub-FittableDataTransformer" are applied based on the position of the TimeSeries in the Sequence.

This makes it not possible to run the FittableDataTransformer if only one TimeSeries is available (e.g. when a mutli-series Model like TFT is used in production). I would suggest selecting the "sub-FittableDataTransformer" based on the static_covariates or switch from a Sequence to a Mappable as input.

And please as a hot-fix add a warning if the length of the sequence of Series to transform is different than expected. It took me ages to figure that out. The current behavior is to just omit the rest if the Sequence is longer.

@dennisbader
Copy link
Collaborator

Hi @maximilianreimer, thanks for writing.

You are totally right. Our data transformers expect to receive the same input dimensions (and same order of list of time series including their components) for fitting and transformation.

We should definitely raise a warning (or even an exception?) if there is a mismatch in dimensions.

I don't quite follow what the issue is with using transformers in production? Can't you fit transform a new transformer only on the series available?

Regarding "sub-FittableDataTransformer":

  • we can't rely on static covariates because not all time series have static covariates
  • the mappable could be interesting, what do you think @hrzn ?

@maximilianreimer
Copy link
Author

Regarding the production issue: Lets say I have n areas I want to forecast electricity prices for. On training time I have a sequence of time series of the prices I pass through my target pipeline and model but in production I might have the request to just predict a specific time series. Model-wise that's not a problem, but how to use the Scaler in this instance?

# Training time
# each with static covaraites to that the TFTModel can learn to predict differently for different areas
train_series = [
 sereis_area_1,  
 sereis_area_2,
...
 sereis_area_n 
]

target_pipeline = Scaler()
model = TFTModel()

train_transformed = target_pipeline.fit_transform(train_series)
model.fit(train_transformed)

# In production time
# Request: predict n time steps for area 5 for the next week
historical_data_area_5 : TimeSeries = ...

# I would like to run
pred_transformed = target_pipeline.transform(historical_data_area_5) # wont work with just one series
predicted = model.predict(7, pred_transformed)
predicted_rescaled = target_pipeline.invers_transform(predicted)  # wont work with just one series

@dennisbader
Copy link
Collaborator

Do you have all historical data for the specific series at prediction time? If so, then you can fit/transform with a new scaler just on this single series as you did before training.

@maximilianreimer
Copy link
Author

maximilianreimer commented Oct 14, 2022

So you are suggesting to train different Scaler for each Series? Or one joint for training and afterwards individual ones for prediction time?

@dennisbader
Copy link
Collaborator

One joint Scaler for training (which should come with a performance boost compared to multiple single Scalers) and afterwards an individual one.
If you have the historical data of the series of interest at prediction time:

  • you split the series at the same time step that you used for training
  • you fit a new Scaler on the left side of the split -> like this you get the same transform() output as with the joint Scaler
  • you can transform any parts of the series with this scaler and use it for prediction

@cristof-r
Copy link
Contributor

What approach would you recommend if we don't have the complete historical data (e.g., only the necessary data for input_chunk_length) at prediction time?

@hrzn
Copy link
Contributor

hrzn commented Oct 30, 2022

Hi @maximilianreimer, thanks for writing.

You are totally right. Our data transformers expect to receive the same input dimensions (and same order of list of time series including their components) for fitting and transformation.

We should definitely raise a warning (or even an exception?) if there is a mismatch in dimensions.

I don't quite follow what the issue is with using transformers in production? Can't you fit transform a new transformer only on the series available?

Regarding "sub-FittableDataTransformer":

  • we can't rely on static covariates because not all time series have static covariates
  • the mappable could be interesting, what do you think @hrzn ?

+1 for raising an exception if the number doesn't match, that's a good point.
Supporting a mappable could be a good idea too. We should still support sequences as well though, so it should come extra. Would you be interested to contribute @maximilianreimer ? Even just raising an exception would be a first step, we would be happy to receive a PR.

@hrzn hrzn changed the title [BUG] Scalar always only return number of Sequence used during fitting Improvements for Scalers applied on multiple series Oct 30, 2022
@hrzn hrzn added good first issue Good for newcomers improvement New feature or improvement and removed bug Something isn't working triage Issue waiting for triaging labels Oct 30, 2022
@madtoinou
Copy link
Collaborator

madtoinou commented Mar 22, 2023

This is solved by #1409, which implements a mappable to match the fit() and the transform() series and raises an error in case of mismatch code snippet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers improvement New feature or improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants