Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/four theta #123

Merged
merged 49 commits into from
Jul 20, 2020
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
01e1837
feat(4Theta): naive implementation of 4Theta model
Droxef Jul 3, 2020
f6831be
fix(theta): avoid NaN values in theta, and unnecessary season test
Droxef Jul 3, 2020
53ceeab
feat(gridsearch): add possibility to compare with model.fitted_values
Droxef Jul 3, 2020
7ae0803
feat(4theta): add a method to auto select best model
Droxef Jul 3, 2020
4001009
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 3, 2020
264ddf3
refactor(4Theta): Specify univariate model
Droxef Jul 3, 2020
e5bc790
style(4Theta): Fix linter
Droxef Jul 3, 2020
e951554
style(4Theta): fix docstring
Droxef Jul 3, 2020
f596476
style(4Theta): Fix docstring
Droxef Jul 3, 2020
b712c7d
style(4Theta): Change link
Droxef Jul 3, 2020
dee5b0d
style(4Theta): Correct docstring
Droxef Jul 3, 2020
ba64351
style(4theta): correct ticks in docstring
Droxef Jul 6, 2020
36d99fe
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 7, 2020
1dc5bb3
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 7, 2020
4d24f75
refactor(4Theta): change different modes verification and add Enum
Droxef Jul 7, 2020
240cea4
refactor(theta): replace all string modes by Enum
Droxef Jul 8, 2020
7d5dffa
test(backtesting): Add a test to verify if fitted_values exist
Droxef Jul 8, 2020
b19cf97
Fix(Theta): Correct all Enums
Droxef Jul 8, 2020
bbd58cc
fix(Theta): compare with enum members value instead. Correct some min…
Droxef Jul 8, 2020
71aa8d3
fix(4theta): move the creation of enums in init file
Droxef Jul 8, 2020
8286062
test(4theta): Add 4Theta to autoregressive test. Move Enums to top in…
Droxef Jul 8, 2020
af69e63
test(4theta): Add 4Theta specific test
Droxef Jul 8, 2020
97f6956
style(backtesting): fix lint
Droxef Jul 8, 2020
97cff2a
test(4theta): Add another exception to test
Droxef Jul 9, 2020
f6df487
ref(4Theta): mode.fitted_values is now a TimeSeries to be consistent
Droxef Jul 13, 2020
2bad4f7
style(Theta): rename mode to season_mode to be consistent w/ FourTheta
Droxef Jul 13, 2020
49fafda
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 13, 2020
fe1b320
docs(thetas): correct errors in the different docs
Droxef Jul 13, 2020
fc04e68
refactor(4Theta): Correct
Droxef Jul 13, 2020
cae49f6
refactor(backtesting): add a 'use_fitted_values' parameter
Droxef Jul 13, 2020
27b9143
fix(4theta): correct select_best_model
Droxef Jul 13, 2020
2fa7f4f
test(4Theta): add a test for zero mean and correct others
Droxef Jul 13, 2020
5ed92bd
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 14, 2020
a2b64e4
style(backtesting): linter formatting
Droxef Jul 14, 2020
98d1b7a
refactor(4Theta): change Enums names, correct theta and backtesting docs
Droxef Jul 16, 2020
f405e8b
refactor(4theta): move creation of fitted_values timeseries to backte…
Droxef Jul 16, 2020
9cad7cf
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 16, 2020
1e6be8e
refactor(statistics): include Enums in extract and remove functions
Droxef Jul 16, 2020
2f9bc5b
refactor(4Theta): check earlier if univariate
Droxef Jul 16, 2020
a405e47
test(4Theta): correct backtesting and test best_model
Droxef Jul 16, 2020
25f3de2
test(4Theta): add new modes in test models
Droxef Jul 16, 2020
e15f52b
docs(4theta): Add a disclaimer for 4theta performance
Droxef Jul 16, 2020
61c0f17
refactor(Theta): change theta to have the same behavior as FourTheta
Droxef Jul 16, 2020
e63e215
examples(darts-intro): modify notebook to give the same results
Droxef Jul 16, 2020
be1427d
style(4Theta): correct deprecation warning for logger.warn
Droxef Jul 16, 2020
34a8c41
Merge branch 'develop' into feat/FourTheta
hrzn Jul 17, 2020
d8e890f
style(4theta): move comment to backtesting
Droxef Jul 20, 2020
e02d65d
Merge branch 'feat/FourTheta' of https://github.com/unit8co/darts int…
Droxef Jul 20, 2020
8e92eba
Merge branch 'develop' of https://github.com/unit8co/darts into feat/…
Droxef Jul 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions darts/backtesting/backtesting.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ def backtest_forecasting(series: TimeSeries,

# build the prediction times in advance (to be able to use tqdm)
pred_times = [start]
while pred_times[-1] <= last_pred_time:
while pred_times[-1] < last_pred_time:
pred_times.append(pred_times[-1] + series.freq())

# what we'll return
Expand Down Expand Up @@ -313,15 +313,16 @@ def backtest_gridsearch(model_class: type,

Parameters
----------
model
model_class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I let this slip through, thanks!

The ForecastingModel subclass to be tuned for 'series'.
parameters
A dictionary containing as keys hyperparameter names, and as values lists of values for the
respective hyperparameter.
train_series
The univariate TimeSeries instance used for training (and also validation in split mode).
test_series
val_series
The univariate TimeSeries instance used for validation in split mode.
Use `train` to compare with model.fitted_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is model.fitted_values for models that contain it, and why would it be useful to use it as a validation set in the grid search? I'm asking because this adds a corner case and increases the complexity (it relies on a convention that model.fitted_values represents a relevant time series for validation, and it abuses the type of val_series), so I wonder if that's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model.fitted_values are the values obtain when fitting a model (ExponentialSmoothing in our case) on a time series, the result of the fit (a TimeSeries with same index as the train time series). It is a quick way to assess the model (we don't have to compute the predictions each time), but I understand that we cannot see if the fitted model underfit or overfit and will give bad predictions.
It is a trade off between computation time and efficiency. Thus, it can be dropped if you think it is not a good thing to have.
I mainly used this version for the M4 benchmark because the original script used this method to select the best model.

fcast_horizon_n
The integer value of the forecasting horizon used in expanding window mode.
num_predictions:
Expand Down Expand Up @@ -360,6 +361,10 @@ def backtest_gridsearch(model_class: type,
if val_series is None: # expanding window mode
backtest_forecast = backtest_forecasting(train_series, model, backtest_start_time, fcast_horizon_n)
error = metric(backtest_forecast, train_series)
elif val_series == 'train':
model.fit(train_series)
# Use ndarray because casting to TimeSeries takes too much time
error = metric(model.fitted_values, train_series.univariate_values())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, FourTheta is the only model that possesses the attribute fitted_values. So it would be good to have a check here to test whether the given model supports this functionality, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least all ExponentialSmoothing have a fitted attribute that can be retrieved (So the current theta model could have one too, I think). If you approve this functionality, it might be interesting to add this attribute in other models.
But yes, a check is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't respect the types. I think you should convert to TimeSeries here, because if the function metric() is changed later on to use any of the TimeSeries attribute this code will fail. You can just build a dummy time axis. I think in general correctness is more important than speed.

else: # split mode
model.fit(train_series)
error = metric(model.predict(len(val_series)), val_series)
Expand Down
223 changes: 219 additions & 4 deletions darts/models/theta.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"""

import math
from typing import Optional
from typing import Optional, List

import numpy as np
import statsmodels.tsa.holtwinters as hw
Expand Down Expand Up @@ -78,7 +78,8 @@ def fit(self, series: TimeSeries, component_index: Optional[int] = None):
self.is_seasonal, self.season_period = check_seasonality(ts, self.season_period, max_lag=max_lag)
logger.info('Theta model inferred seasonality of training series: {}'.format(self.season_period))
else:
self.is_seasonal = True # force the user-defined seasonality to be considered as a true seasonal period.
# force the user-defined seasonality to be considered as a true seasonal period.
self.is_seasonal = self.season_period > 1

new_ts = ts

Expand All @@ -88,13 +89,13 @@ def fit(self, series: TimeSeries, component_index: Optional[int] = None):
new_ts = remove_seasonality(ts, self.season_period, model=self.mode)

# SES part of the decomposition.
self.model = hw.SimpleExpSmoothing(new_ts.values()).fit()
self.model = hw.SimpleExpSmoothing(new_ts.values()).fit(initial_level=0.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the effect of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with SES in statsmodels is that alpha is not constrained between [0.1, 0.99] as it should, thus giving NaN values when alpha is 0.
Setting the initial_level to 0.2 (or anything else) avoided all encountered cases where the optimization gave alpha=0.
It is more a hotfix rather than an actual solution before statsmodels corrects the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it leads to different results. I will add it only in the case alpha is 0, and it will recompute it


# Linear Regression part of the decomposition. We select the degree one coefficient.
b_theta = np.polyfit(np.array([i for i in range(0, self.length)]), (1.0 - self.theta) * new_ts.values(), 1)[0]

# Normalization of the coefficient b_theta.
self.coef = b_theta / (2.0 - self.theta)
self.coef = b_theta / (2.0 - self.theta) # change to b_theta / (-self.theta) if classical theta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you manage to figure out why is the formula written like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically, I'm still not quite sure about the rest, but I am sure it is correct for theta = 0 and 1.
And the definition of the theta lines are not quite the same, thus the confusing theta.

But comparing with the classical theta (4theta with default parameter), we have the exact same results if we change the normalization by -1/theta. (because there is a symmetry: 2-theta_1 = theta_2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we unify the meaning of theta? E.g. by changing to b_theta / (-self.theta) and adapting the default value to theta=2 ?


self.alpha = self.model.params["smoothing_level"]

Expand Down Expand Up @@ -126,3 +127,217 @@ def predict(self, n: int) -> 'TimeSeries':

def __str__(self):
return 'Theta({})'.format(self.theta)


class FourTheta(UnivariateForecastingModel):
def __init__(self,
theta: int = 0,
seasonality_period: Optional[int] = None,
model_mode: str = 'multiplicative',
season_mode: str = 'multiplicative',
trend_mode: str = 'linear'):
"""
An implementation of the 4Theta method with configurable `theta` parameter.

See M4 competition `solution <https://github.com/Mcompetitions/M4-methods/blob/master/4Theta%20method.R>`_.

The training time series is de-seasonalized according to `seasonality_period`,
or an inferred seasonality period.

This model is similar to Theta, with `theta` = 2-theta, `model_mode` = additive and `trend_mode` = linear.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is important, but doesn't make it clear what is what. For instance, when reading only this sentence it's not clear whether FourTheta() is similar to Theta(trend_model="linear") or FourTheta(trend_mode="linear") = Theta().

You could reformulate, e.g. something like

When called with `theta = 2 - X`, `model_mode = "additive"` and `trend_mode = "linear"`, this model is equivalent to `Theta(theta=X)`.


Parameters
----------
theta
Value of the theta parameter. Defaults to 2.
If theta = 1, then the theta method restricts to a simple exponential smoothing (SES)
seasonality_period
User-defined seasonality period. If not set, will be tentatively inferred from the training series upon
calling `fit()`.
model_mode
Type of model combining the Theta lines. Either "additive" or "multiplicative".
season_mode
Type of seasonality. Either "additive" or "multiplicative".
trend_mode
Type of trend to fit. Either "linear" or "exponential".
"""

super().__init__()

self.model = None
self.drift = None
self.coef = 1
self.mean = 1
self.length = 0
self.theta = theta
self.is_seasonal = False
self.seasonality = None
self.season_period = seasonality_period
self.model_mode = model_mode
self.season_mode = season_mode
self.trend_mode = trend_mode
self.wses = 0 if self.theta == 0 else (1 / theta)
self.wdrift = 1 - self.wses
self.fitted_values = None

# Remark on the values of the theta parameter:
# - if theta = 1, then the theta method restricts to a simple exponential smoothing (SES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the same comment in the Theta model. But I don't understand how that's possible given that one can be the same as the other with 2-theta. Is it correct to leave it here?
Also: I would move this (helpful) comment to the docstring.

# - if theta = 0, then the theta method restricts to a simple `trend_mode` regression.

def fit(self, ts, component_index: Optional[int] = None):
super().fit(ts, component_index)

self.length = len(ts)
# normalization of data
self.mean = ts.mean()
new_ts = ts / self.mean

# Check for statistical significance of user-defined season period
# or infers season_period from the TimeSeries itself.
if self.season_period is None:
max_lag = len(ts) // 2
self.is_seasonal, self.season_period = check_seasonality(ts, self.season_period, max_lag=max_lag)
logger.info('Theta model inferred seasonality of training series: {}'.format(self.season_period))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FourTheta model

else:
# force the user-defined seasonality to be considered as a true seasonal period.
self.is_seasonal = self.season_period > 1

# Store and remove seasonality effect if there is any.
if self.is_seasonal:
_, self.seasonality = extract_trend_and_seasonality(new_ts, self.season_period, model=self.season_mode)
new_ts = remove_seasonality(new_ts, self.season_period, model=self.season_mode)

if (new_ts <= 0).values.any():
self.model_mode = 'additive'
self.trend_mode = 'linear'
logger.warn("Time series has negative values. Fallback to additive and linear model")

# Drift part of the decomposition
if self.trend_mode == 'linear':
linreg = new_ts.values()
elif self.trend_mode == 'exponential':
linreg = np.log(new_ts.values())
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be less code and more coherent with other implementations if we check for either linear or exponential with assert and throw an exception if none of the values is selected.

self.trend_mode = 'linear'
linreg = new_ts.values()
logger.warn("Unknown value for trend. Fallback to linear.")
self.drift = np.poly1d(np.polyfit(np.arange(self.length), linreg, 1))
theta0_in = self.drift(np.arange(self.length))
if self.trend_mode == 'exponential':
theta0_in = np.exp(theta0_in)

if self.model_mode == 'additive':
theta_t = self.theta * new_ts.values() + (1 - self.theta) * theta0_in
elif self.model_mode == 'multiplicative' and (theta0_in > 0).all():
theta_t = (new_ts.values() ** self.theta) * (theta0_in ** (1 - self.theta))
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as before, we can assert at the beginning and throw an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a first draft, I blindly replicated the step of the original alorithm. So I will change it.
But we still need to check if all values are positive if 'multiplicative' models are chosen or else it will fail.
I can either raise an error, or simply keep the fallback as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a more optimized implementation, like our theta. But we need to do the math to find it

self.model_mode = 'additive'
theta_t = self.theta * new_ts.values() + (1 - self.theta) * theta0_in
logger.warn("Negative Theta line. Fallback to additive model")

# SES part of the decomposition.
self.model = hw.SimpleExpSmoothing(theta_t).fit()
theta2_in = self.model.fittedvalues

if self.model_mode == 'additive':
self.fitted_values = self.wses * theta2_in + self.wdrift * theta0_in
elif self.model_mode == 'multiplicative' and (theta2_in > 0).all():
self.fitted_values = theta2_in**self.wses * theta0_in**self.wdrift
else:
# Fallback to additive model
self.model_mode = 'additive'
theta_t = self.theta * new_ts.values() + (1 - self.theta) * theta0_in
self.model = hw.SimpleExpSmoothing(theta_t).fit()
theta2_in = self.model.fittedvalues
self.fitted_values = self.wses * theta2_in + self.wdrift * theta0_in
logger.warn("Negative Theta line. Fallback to additive model")

if self.is_seasonal:
if self.season_mode == 'additive':
self.fitted_values += self.seasonality.values()
elif self.season_mode == 'multiplicative':
self.fitted_values *= self.seasonality.values()
self.fitted_values *= self.mean
# takes too much time to create a time series for fitted_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the overhead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you add a comment explaining what fitted_values represent?

# self.fitted_values = TimeSeries.from_times_and_values(ts.time_index(), self.fitted_values)

def predict(self, n: int) -> 'TimeSeries':
super().predict(n)

# Forecast of the SES part.
forecast = self.model.forecast(n)

# Forecast of the Linear Regression part.
drift = self.drift(np.arange(self.length, self.length + n))
if self.trend_mode == 'exponential':
drift = np.exp(drift)

if self.model_mode == 'additive':
forecast = self.wses * forecast + self.wdrift * drift
elif self.model_mode == 'multiplicative':
forecast = forecast**self.wses * drift**self.wdrift
else:
raise_log(ValueError("model_mode cannot be {}".format(self.model_mode)))

# Re-apply the seasonal trend of the TimeSeries
if self.is_seasonal:

replicated_seasonality = np.tile(self.seasonality.pd_series()[-self.season_period:],
math.ceil(n / self.season_period))[:n]
if self.season_mode in ['multiplicative', 'mul']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simplicity I would also pick one of them, and also an ENUM that we pass here might be helpful to avoid typos https://docs.python.org/3/library/enum.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure I understand correctly, the different model arguments will accept an Enum value?
I can change the theta method as well in the same manner. Thus Enum classes will be declared exterior to the class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this comment actually applies to both theta methods - instead of retyping "multiplicative"/"additive" string we can have an enum that would have 2 values - MULTIPLICATIVE or ADDITIVE and if you make a typo the IDE will tell you something is not matching (not the case for raw strings)

forecast *= replicated_seasonality
elif self.season_mode in ['additive', 'add']:
forecast += replicated_seasonality
else:
raise_log(ValueError("season_mode cannot be {}".format(self.season_mode)))

forecast *= self.mean

return self._build_forecast_series(forecast)

@staticmethod
def select_best_model(ts: TimeSeries, thetas: List[int] = None, m: Optional[int] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add type annotation for the return type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since you plan for the case thetas is None later, update thetas: Optional[List[int]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot about that. I will change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to think where to put this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will depend on the refactoring of backtesting. If gridsearch become an attribute of ForcastingModel, it becomes possible to overload it here.

"""
Performs a grid search over all hyper parameters to select the best model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention that it is using the fitted values on the training series here.


Parameters
----------
ts
The TimeSeries on which the model will be tested.
thetas
A list of thetas to loop over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention the default values here.

m
Optionally, the season used to decompose the time series.
Returns
-------
theta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should have a type

The best performing model on the time series.
"""
# Only import if needed
from ..backtesting.backtesting import backtest_gridsearch
from sklearn.metrics import mean_absolute_error as mae
if thetas is None:
thetas = [1, 2, 3]
elif isinstance(thetas, int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this case, that's abusing the types, better let it crash IMO.

thetas = [thetas]
season_mode = ["additive", "multiplicative"]
model_mode = ["additive", "multiplicative"]
drift_mode = ["linear", "exponential"]
if (ts.values() <= 0).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a log message if this happens?

drift_mode = ["linear"]
model_mode = ["additive"]
season_mode = ["additive"]

theta = backtest_gridsearch(FourTheta,
{"theta": thetas,
"model_mode": model_mode,
"season_mode": season_mode,
"trend_mode": drift_mode,
"seasonality_period": [m]
},
ts, val_series='train', metric=mae)
return theta

def __str__(self):
return '4Theta(theta:{}, curve:{}, model:{}, seasonality:{})'.format(self.theta, self.trend_mode,
self.model_mode, self.season_mode)