[BUG] timeseries split function doesn't work align with percentage properly #2063

hxuaj · 2023-11-10T07:59:24Z

Hi,
I was going through the Quickstart and ran into series.split_before(0.75). As the AirPassengersDataset, total 144 lines(time steps) in the dataset. Given 75% percent we can get the index to split, which is 108. From my perspective, split_before index 108 means, the first half serie will have time steps from 0 to 107, which means there will be 108 time steps in the first half.

However, the result of series.split_before(0.75) splits the first half from index 0 to index 106 without the time step at 107, which is counter-intuitive. I've spotted the line of code which lead to this "issue". In timeseries.py, point_index = int((len(self) - 1) * point). The split index is calculated by the length - 1 times point(0.75 in the case above), which is int((144 -1 ) * 0.75)=107. Then the rest of the job is just split at index 107 without time step 107.

I'm not sure if its a determined bug, since the rest of Quickstart code split at index 108 other than 107. For example, just below the series.split_before(0.75), there is series1, series2 = series[:-36], series[-36:]. I guess the intention of the author is to split the dataset with 75%(108 elements) and 25%(36 elements). But with series.split_before(0.75), the dateset is split into 74%(107 elements) and 26%(37 elements).

Hope someone could address this, since the usage of percentage of series.split_before(0.75) is not completely align with the fraction.

To Reproduce
Simply run to the 3rd code block in the official Quickstart.

System (please complete the following information):

Python version: 3.9.18
darts version: u8darts-all 0.26.0
OS: Linux

The text was updated successfully, but these errors were encountered:

madtoinou · 2023-11-20T08:15:36Z

Hi @hxuaj,

When the argument of split_before is a float, it correspond to the proportion of the series to be included in the first split (splitting point excluded, as described in the docstring), not the explicit relative position of the index used for the split. Since 75% of 144 is exactly 108, this splitting point is contained in the second part of the split, resulting of series of length 107 and 37 respectively.

I agree that the example is a bit counter-intuitive with respect to the next cell and I think that an easy way to clarify it would be to use split_after instead, to obtain identical splits in the two cells.

hxuaj added bug Something isn't working triage Issue waiting for triaging labels Nov 10, 2023

madtoinou removed the triage Issue waiting for triaging label Nov 15, 2023

madtoinou mentioned this issue Nov 24, 2023

Fix/update doc #2091

Merged

madtoinou closed this as completed in #2091 Nov 29, 2023

github-project-automation bot added this to darts Aug 29, 2024

github-project-automation bot moved this to Released in darts Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] timeseries split function doesn't work align with percentage properly #2063

[BUG] timeseries split function doesn't work align with percentage properly #2063

hxuaj commented Nov 10, 2023

madtoinou commented Nov 20, 2023

[BUG] timeseries split function doesn't work align with percentage properly #2063

[BUG] timeseries split function doesn't work align with percentage properly #2063

Comments

hxuaj commented Nov 10, 2023

madtoinou commented Nov 20, 2023