Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] timeseries split function doesn't work align with percentage properly #2063

Closed
hxuaj opened this issue Nov 10, 2023 · 1 comment · Fixed by #2091
Closed

[BUG] timeseries split function doesn't work align with percentage properly #2063

hxuaj opened this issue Nov 10, 2023 · 1 comment · Fixed by #2091
Labels
bug Something isn't working

Comments

@hxuaj
Copy link

hxuaj commented Nov 10, 2023

Hi,
I was going through the Quickstart and ran into series.split_before(0.75). As the AirPassengersDataset, total 144 lines(time steps) in the dataset. Given 75% percent we can get the index to split, which is 108. From my perspective, split_before index 108 means, the first half serie will have time steps from 0 to 107, which means there will be 108 time steps in the first half.

However, the result of series.split_before(0.75) splits the first half from index 0 to index 106 without the time step at 107, which is counter-intuitive. I've spotted the line of code which lead to this "issue". In timeseries.py, point_index = int((len(self) - 1) * point). The split index is calculated by the length - 1 times point(0.75 in the case above), which is int((144 -1 ) * 0.75)=107. Then the rest of the job is just split at index 107 without time step 107.

I'm not sure if its a determined bug, since the rest of Quickstart code split at index 108 other than 107. For example, just below the series.split_before(0.75), there is series1, series2 = series[:-36], series[-36:]. I guess the intention of the author is to split the dataset with 75%(108 elements) and 25%(36 elements). But with series.split_before(0.75), the dateset is split into 74%(107 elements) and 26%(37 elements).

Hope someone could address this, since the usage of percentage of series.split_before(0.75) is not completely align with the fraction.

To Reproduce
Simply run to the 3rd code block in the official Quickstart.

System (please complete the following information):

  • Python version: 3.9.18
  • darts version: u8darts-all 0.26.0
  • OS: Linux
@hxuaj hxuaj added bug Something isn't working triage Issue waiting for triaging labels Nov 10, 2023
@madtoinou madtoinou removed the triage Issue waiting for triaging label Nov 15, 2023
@madtoinou
Copy link
Collaborator

Hi @hxuaj,

When the argument of split_before is a float, it correspond to the proportion of the series to be included in the first split (splitting point excluded, as described in the docstring), not the explicit relative position of the index used for the split. Since 75% of 144 is exactly 108, this splitting point is contained in the second part of the split, resulting of series of length 107 and 37 respectively.

I agree that the example is a bit counter-intuitive with respect to the next cell and I think that an easy way to clarify it would be to use split_after instead, to obtain identical splits in the two cells.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Released
Development

Successfully merging a pull request may close this issue.

2 participants