Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch-forecasting and dsipts v2 API design #39

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

fkiraly
Copy link
Contributor

@fkiraly fkiraly commented Jan 31, 2025

An enhancement proposal and design document towards v2 of pytorch-forecasting and dsipts.

More context here: sktime/pytorch-forecasting#1736

@felipeangelimvieira
Copy link
Contributor

(bringing the discussion from discord to here)
Really interesting, Franz... I’ll post a more detailed comment here once I’ve organized my thoughts. The first question that comes to mind is: could we condense the information in the output of __getitem__ of BaseTSDataset into fewer objects?

@felipeangelimvieira
Copy link
Contributor

About static_features: not all deep-learning TS models have static features. Example: N-BEATS. At the same time, I believe this is something really relevant to TFT, for example.

Given that, should this be positioned at "model" level instead?

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 3, 2025

The first question that comes to mind is: could we condense the information in the output of __getitem__ of BaseTSDataset into fewer objects?

Yes, that is also what @thawn said, let's try to have as few tensors as possible.

My answer: perhaps - but it is hard to actually write it down.

Some comments:

  • a lot of the fields are metadata, about properties of variables (future-known or not, categorical/numerical, etc)
  • we need at least 2 if we have only past data, and 3 if we have future data, simply given the native "shapes" of static, future and past data. Tensors can be rectangular only, and it is not possible to have all that data in less than 3 because of the shapes.
  • separating index is a conscious choice, otherwise you have an additional metadata field. And you are open to a future modification where you encode a RangeIndex or period index more efficiently.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 3, 2025

About static_features: not all deep-learning TS models have static features. Example: N-BEATS. At the same time, I believe this is something really relevant to TFT, for example.

Given that, should this be positioned at "model" level instead?

I think not, since it is a "data" concern:

  • The information on which variable is static is a data property, and the values of the static variables are data (not a model part or modelling choice)
  • the "minimal representation" of static variables involves a tensor of smaller shape (n_instances x n_static_variables) - 2D and not 3D (instance, variable, time)
  • the corresponding "model" concern is "handles static variables", this is a capability type property. For a unified interface, all models need to be able to take a dataset with static variables, the ones without capability will just ignore them.

@felipeangelimvieira
Copy link
Contributor

What I'm thinking about is that static variables are a specific kind of past and future known exogenous variables, that are constant throughout time. In this sense, should we really separate them? I feel they are essentially the same in a "meta" level. The difference is how the model wants to use them.

One example: we could use fourier features as static variables in TFT. I'm not sure if there's something that do not allow us to use it that way, since these features are known for every timepoint in past and future.

@felipeangelimvieira
Copy link
Contributor

More thoughts:

  • Couldn't __getitem__ return just the relevant data, and metadata be stored in attributes self.y_cols_, self.x_cols_ of the dataset?
  • Should we have a BaseTorchTransformer that defines an interface for transformations (such as normalization of past values in a batch) at torch level? (replacing current encoders)

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 4, 2025

What I'm thinking about is that static variables are a specific kind of past and future known exogenous variables, that are constant throughout time. In this sense, should we really separate them? I feel they are essentially the same in a "meta" level. The difference is how the model wants to use them.

To me, this is not correct, on the conceptual level.

The formal model I typically adopt is that all data points are indexed. Time-varying (or "dynamic") variables change over time, and obervations (entries, values, etc) are indexed by: (instance index, variable name, time index). Whereas, static variables are only indexed by (instance index, variable name). If you adopt the usual model of hierarchical data, where a combination of indices indexes an observation, it becomes clear that static variables live at different "levels" in the indexing hierarchy.

Mapping static variables on a column on the lower level (plus time index) has them being constant as a side effect in representation, rather than as a conceptual feature.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 4, 2025

  • Couldn't __getitem__ return just the relevant data, and metadata be stored in attributes self.y_cols_, self.x_cols_ of the dataset?

There are two ideas here, I believe: (a) nesting metadata, and (b) returning metadata not by __getitem__ but another API point.

I am very open to (a) and think this is perhaps a good idea. Regarding (b), could you explain how the retrieval of the metadata would work? The __getitem__ return seems to be the single intended interface point, if you go through the entire lightning workflow, e.g., DataLoader does not assume anything except (if I understand correctly). If you disagree, I would appreciate an explanation how the information in the suggested attributes will reach the model, if it needs to.

Should we have a BaseTorchTransformer that defines an interface for transformations (such as normalization of past values in a batch) at torch level? (replacing current encoders)

Yes, that would be a great idea - do you want to draw up a design? And, surely this exists elsewhere, too?

@felipeangelimvieira
Copy link
Contributor

felipeangelimvieira commented Feb 4, 2025

The formal model I typically adopt is that all data points are indexed. Time-varying (or "dynamic") variables change over time, and obervations (entries, values, etc) are indexed by: (instance index, variable name, time index). Whereas, static variables are only indexed by (instance index, variable name). If you adopt the usual model of hierarchical data, where a combination of indices indexes an observation, it becomes clear that static variables live at different "levels" in the indexing hierarchy.

I understand your point that, since they can be indexed by different amout of indexes while keeping the information, they are and can be classified as different things. Nevertheless, I believe that the challenge here is projecting this higher dimensional space of concepts (e.g. classifying variables by instance, variable name, time index, discrete or continuous, etc) into a lower dimensional space that still makes the task (implementing different NN-based forecasters) feasible.

In this regard, the separation of data in 1. target variable, 2. features with past knowns, 3. features with past/future knowns is possibly the lowest dimensional representation, and that can be further subdivided into other categories/concepts, such as features with past/future knowns that are static and do not change overtime (we can imagine it as a tree).

Mapping static variables on a column on the lower level (plus time index) has them being constant as a side effect in representation, rather than as a conceptual feature.

I thinking in the opposite direction: static variables are features, they belong to this "feature" concept: relevant data other than the target variable used as input to predict the target variable. They are a special subset of features, that do not change over time, but this does not place them outside the set of features. In particular, being part of this subset allows them to be represented with less index levels as you mentioned.

Some models, such as DeepAR, may still want to use "static variables" as exogenous (I believe this is the case, by inspecting the implementation and the property self.reals inherited from parentBaseModelWithCovariates). So the differentiation of static and non-static is actually model dependent. TFT, on the other hand, uses static variables in a particular way.

@felipeangelimvieira
Copy link
Contributor

felipeangelimvieira commented Feb 4, 2025

am very open to (a) and think this is perhaps a good idea. Regarding (b), could you explain how the retrieval of the metadata would work? The getitem return seems to be the single intended interface point, if you go through the entire lightning workflow, e.g., DataLoader does not assume anything except (if I understand correctly). If you disagree, I would appreciate an explanation how the information in the suggested attributes will reach the model, if it needs to.

good point... If this is a direction we would like to go, I believe there would be possibly two solutions:

  • Create a custom dataloader
  • Create a custom Tensor (inherting from pytorch's tensor, that keeps the metadata of the features).

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 5, 2025

Some models, such as DeepAR, may still want to use "static variables" as exogenous (I believe this is the case, by inspecting the implementation and the property self.reals inherited from parentBaseModelWithCovariates). So the differentiation of static and non-static is actually model dependent. TFT, on the other hand, uses static variables in a particular way.

But, this is not an intrinsic property of the static varibales. It is how the model chooses to use them. So it is not a data property but a modelling choice, and specifically a "reduction" choice where we reformat the data (like in the reduction forecaster, just a different type of making the data "rectangular-formatted")

Therefore, with the domain driven design philosophy, the choice of reformatting the static variables falls with the model rather than the data layer.

Create a custom dataloader

I wanted to do that too but was voted down and convinced by good arguments in the Jan 20 meeting, reference: sktime/pytorch-forecasting#1736
(open the dropdown and search for dataloader, for defails)

The key argument for me was, current vignettes in all packages reviewed rely on the default DataLoader, and details seem unclear on how this would interact with model and lightning layer. Naive approaches seem to lead to coupling throughout.

Create a custom Tensor (inherting from pytorch's tensor, that keeps the metadata of the features).

I think that is a hypothetical possibility but a bad idea. We would be rewriting layers outside the scope of the package.

I still have nightmares from our attempts to rewrite DataFrame for sktime.
https://github.com/sktime/data-container

@prockenschaub did great work, and I think our designs would have been successful in the end - it just turned out we were getting pulled into the scope of rewriting pandas from scratch, with more resources required than we had (focusing on the core package and all).

The lesson from that is, I suppose, if you are building a house and suddenly find yourself wanting to tear down your neighbour's and then the entire street, there is maybe something architectural to rethink.

@phoeenniixx
Copy link

phoeenniixx commented Feb 7, 2025

Hi Everyone! I have an idea for D1 layer, not sure if this will work but I thought to reduce the metadata by using dicts.
I would really appreciate the comments

D1 __getitem__ returns:

* t: tensor of shape (n_timepoints)
  Time index for each time point in the past or present. Aligned with ``y``, and ``x`` not ending in ``f``.
* y: tensor of shape (n_timepoints, n_targets)
  Target values for each time point. Rows are time points, aligned with ``t``.
* x: tensor of shape (n_timepoints, n_features)
  Features for each time point. Rows are time points, aligned with ``t``.
* group: tensor of shape (n_groups)
  Group identifiers for time series instances.
* st: tensor of shape (n_static_features)
  Static features.
* col_map: dict { 'y': list[str], 'x': list[str], 'st': list[str] }
  Names of columns for y, x, and static features.
* st_col_type: dict[str, int]
  Names of static features mapped to type (0=categorical, 1=numerical).
  Example: { 'region': 0, 'temp': 1 }
* y_types: list of int (length = n_targets)
  Type of each target variable: 0 (categorical), 1 (numerical).
* x_types: list of int (length = n_features)
  Type of each feature:  
  - 0 = categorical unknown  
  - 1 = categorical known  
  - 2 = numerical unknown  
  - 3 = numerical known
Optionally, the following str-keyed entries can be included:
    * t_f: tensor of shape (n_timepoints_future)
      Time index for each time point in the future.
      Aligned with ``x_f``.
    * x_f: tensor of shape (n_timepoints_future, n_features)
      Known features for each time point in the future.
      Rows are time points, aligned with ``t_f``.
    * weights: tensor of shape (n_timepoints), only if weight is not None
    * weight_f: tensor of shape (n_timepoints_future), only if weight is not None

In this way we have a little less metadata

@phoeenniixx
Copy link

phoeenniixx commented Feb 7, 2025

There are two ideas here, I believe: (a) nesting metadata, and (b) returning metadata not by __getitem__ but another API point.

As we are using just lightning modules (please correct me if I am wrong) we can get the metadata in setup function which is a part of the module: https://lightning.ai/docs/pytorch/LTS/common/lightning_module.html#setup

(Although not sure how much of it is feasible😅)

example:

class MyModel(LightningModule):
    def setup(self, stage=None):
        dataset = self.trainer.datamodule.train_dataset
        self.col_map = dataset.col_map
        self.st_col_type = dataset.st_col_type
        self.y_types = dataset.y_types
        self.x_types = dataset.x_types

    def forward(self, x):
     # Now we have metadata available for processing for forward
        pass

    def training_step(self, batch, batch_idx):
    # training logic
        pass

image
Source: https://www.assemblyai.com/blog/pytorch-lightning-for-dummies/

More examples: https://lightning.ai/docs/pytorch/stable/data/datamodule.html#what-is-a-datamodule

Or we can do this:

class DecoderEncoderData(Dataset):
    def __init__(self, tsd: PandasTSDataSet, **params):
        self.tsd = tsd  # Store dataset reference

        # Access metadata from dataset (D1)
        self.col_map = tsd.col_map
        self.st_col_type = tsd.st_col_type
        self.y_types = tsd.y_types
        self.x_types = tsd.x_types

    def __getitem__(self, idx):
        sample = self.tsd[idx]  
        return sample

@phoeenniixx
Copy link

A doubt: are we using pytorch-lightning or lightning? As this is the thing that confuses me everytime I work on ptf, both are similar yet different and although they have similar names and functions they can't be used interchangeably as even their trainers are different (although both have setup support)
I am just asking this for future refrences :)

@felipeangelimvieira
Copy link
Contributor

felipeangelimvieira commented Feb 7, 2025

I think that is a hypothetical possibility but a bad idea. We would be rewriting layers outside the scope of the package.

What I meant is not using this tensor in every part of the code. Just having a code encapsulating the input data tensor with this metadata. This something documented in pytorch's documentation

But, this is not an intrinsic property of the static varibales. It is how the model chooses to use them. So it is not a data property but a modelling choice, and specifically a "reduction" choice where we reformat the data (like in the reduction forecaster, just a different type of making the data "rectangular-formatted")

Therefore, with the domain driven design philosophy, the choice of reformatting the static variables falls with the model rather than the data layer.

I would argue that "static" features are not a real phenomenon; they are a separation made in the original paper. Nothing impedes the usage of other past and future known features in the same way that "static" features are used.

One example of a use case that might encounter difficulties when separating static and non-static variables in the data layer is as follows: What if I want to use non-constant features (such as Fourier terms, which are known for both the past and future) in the same way that static variables are used in TFT? This approach could be interesting if we want the gating mechanisms to depend on seasonality. For example, holidays have different effects depending on the day of the week, and using sine and cosine functions as static features could be helpful in this scenario.

@felipeangelimvieira
Copy link
Contributor

A doubt: are we using pytorch-lightning or lightning? As this is the thing that confuses me everytime I work on ptf, both are similar yet different and although they have similar names and functions they can't be used interchangeably as even their trainers are different (although both have setup support)
I am just asking this for future refrences :)

hi @phoeenniixx ! I believe it is lightning, based on pyproject.toml's content.

@phoeenniixx
Copy link

Thanks for clarifying this! @felipeangelimvieira

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 8, 2025

@phoeenniixx, thanks a lot for your useful thoughts!

A doubt: are we using pytorch-lightning or lightning?

I think the latter is a rebrand of the former, the package and hte project have been renamed.

we can get the metadata in setup function

Interesting - how would the entire workflow look like if we do this?

Though, using setup py seems to deviate slightly from its intended specification, the docs refer to data processing related setup rather than metadata handling.

I thought to reduce the metadata by using dicts.
I would really appreciate the comments

That's a very interesting idea!

Following up on this, I would actually put even more in single dicts, and use strings (like in pandas) for types:

D1 __getitem__ returns:

* t: tensor of shape (n_timepoints)
  Time index for each time point in the past or present. Aligned with ``y``, and ``x`` not ending in ``f``.
* y: tensor of shape (n_timepoints, n_targets)
  Target values for each time point. Rows are time points, aligned with ``t``.
* x: tensor of shape (n_timepoints, n_features)
  Features for each time point. Rows are time points, aligned with ``t``.
* group: tensor of shape (n_groups)
  Group identifiers for time series instances.
* st: tensor of shape (n_static_features)
  Static features.
* cols: dict { 'y': list[str], 'x': list[str], 'st': list[str] }
  Names of columns for y, x, and static features. List elements are in same order as column dimensions.
  Columns not appearing are assumed to be named x0, x1, etc, y0, y1, etc, st0, st1, etc.
* col_type: dict[str, str]
  maps column names to data types "F" (numerical) and "C" (categorical).
  Column names not occurring are assumed "F".
* col_known: dict[str, str]
  maps column names to "K" (future known) or "U" (future unknown).
  Column names not occurring are assumed "K".
  
Optionally, the following str-keyed entries can be included:
    * t_f: tensor of shape (n_timepoints_future)
      Time index for each time point in the future.
      Aligned with ``x_f``.
    * x_f: tensor of shape (n_timepoints_future, n_features)
      Known features for each time point in the future.
      Rows are time points, aligned with ``t_f``.
    * weights: tensor of shape (n_timepoints), only if weight is not None
    * weight_f: tensor of shape (n_timepoints_future), only if weight is not None

@phoeenniixx
Copy link

Thanks for the comments @fkiraly !

how would the entire workflow look like if we do this?

Well I have not thought this through but we can do something like this:

# Layer D1
class PandasTSDataSet:
    def __init__(self, df, y_cols, x_cols, st_cols, y_types, x_types, st_types):
        self.df = df
        self.metadata = [metadata]    
    def __getitem__(self, idx):
        # Returns only tensors, no metadata
        return {
            't': tensor(...),
            'y': tensor(...),
            'x': tensor(...),
            'group': tensor(...),
            'st': tensor(...),
            't_f': tensor(...),
            'x_f': tensor(...),
        }
    
    def get_metadata(self):
        return self.metadata

# Layer D2
class DecoderEncoderDataModule(LightningDataModule):
    def __init__(self, d1_dataset, batch_size=32, num_workers=4):
        super().__init__()
        # initialize other  params
        self.d1_dataset = d1_dataset
        self.batch_size = batch_size
        self.metadata = None  # Will be set in setup()
    
    def setup(self, stage=None):
        # Get metadata from D1 layer during setup
        self.metadata = self.d1_dataset.get_metadata()
        
        if stage == 'fit' or stage is None:
            # Any train-specific setup
            pass
        
        if stage == 'test' or stage is None:
            # Any test-specific setup
            pass
            
    def train_dataloader(self):
        return DataLoader(
            self.d1_dataset,
            batch_size=self.batch_size,
            **other_params
        )

# Layer M
class DecoderEncoderModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.metadata = None
        
    def setup(self, stage=None):
        # Get metadata from datamodule during setup
        self.metadata = self.trainer.datamodule.metadata
        
        # Initialize layer T model using metadata
    
    def forward(self, x):
     # forward logic
        pass

Though, using setup py seems to deviate slightly from its intended specification, the docs refer to data processing related setup rather than metadata handling.

Well yes, but I found this was the closest way where data and related fields are handled in the model. Or as I mentioned earlier, we can just pass the metadata directly to other layers by doing something similar to this.

Here we can even define a function like get_metadata() that returns everything that's necessary, that will reduce the scope of missing any metadata field.Like:

class DecoderEncoderData(Dataset):
    def __init__(self, tsd: PandasTSDataSet, **params):
        self.tsd = tsd  # Store dataset reference

        # Access metadata from dataset (D1)
        self.metadata = tsd.get_metadata()

    def __getitem__(self, idx):
        sample = self.tsd[idx]  
        # other required additions to ``sample``
        return sample

In this way, we don't deviate from setup documentation, pass the metadata as well to other layers and we don't have to return the metadata in the __getitem__.

Following up on this, I would actually put even more in single dicts, and use strings (like in pandas) for types:

Yeah! we can do that as well. Didn't strike me at that time. Thanks!

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 8, 2025

@phoeenniixx, brilliant! Great thinking!

In restrospect - looking at your post, it looks like with layer D2 we were trying to reinvent the LightningDataModule interface!

Nice! I will need some time to digest, but I think we are getting close now.

FYI @agobbifbk, @felipeangelimvieira

@phoeenniixx
Copy link

Thanks! @fkiraly

In restrospect - looking at your post, it looks like with layer D2 we were trying to reinvent the LightningDataModule interface!

Well I didn't fiind anyother way in lightning module that can handle this operation so we need to do something :)

@agobbifbk
Copy link
Collaborator

Hello there!
I'm trying to fit this structure in DSIPTS:

  1. Layer D1 PandasTSDataSet easy, it is almost a copy-paste from the original implementation. There are some problems in the __getitem__ function since we can not pass strings (categorical variables) and timestamps! The first it is easy to solve, just use a LabelEncoder, the second one is more difficult, maybe again LabelEncoder? It may depend on the tipe of time we have (integers, dates, datetimes)...Moreover I thought useful to use the IDX of the getitem for retrieving only data of a certain group. This will hard affects the D2 layer implementation! But at least we are sure that one item of the D1 contains ONE time series. In the case we are reading from disk I suppose that the IDX will be another thing, like a chunk of data, that can have inside again the group that can be different among the chunks (for example group 1 can be or not in the chunk 1). I think that we can precompute the mapping chunk->groups (useful in the D2!)

  2. Layer D2: I'm finding some difficulties to think it working properly in the case of a non-PandasTSDataSet. For sure this snipped is not what we want

    def train_dataloader(self):
        return DataLoader(
            self.d1_dataset, #--> the dataloader then feeds the deep learning model, so we need to slice data before 
            batch_size=self.batch_size,
            **other_params
        )

One solution is to create a custom dataset and:

    def train_dataloader(self):
        return DataLoader(
            MyDataset(self.d1_dataset), #--> custom dataloader
            batch_size=self.batch_size,
            **other_params
        )

where:

class MyDataset(Dataset):

    def __init__(self,data,metadata)->torch.utils.data.Dataset:
       
        self.data = data  
        self.metadata = metadata
        self.metadata_dataset = data.metadata
    def __len__(self):
        
        return ?? ##hard to define now :-) 

    def __getitem__(self, idxs): ##this is the idx of the dataloader DEPENDS but it DEPENDS ON IDX ---> problem!
        sample = {}
        IDX = random.randint(0,len(ds)) ##THIS IS THE IDX OF THE TIMESERIES....RANDOM IS CORRECT???
        tmp = self.data.__getitem__(IDX)
        for k in self.data.keys:
            if k in self.metadata_dataset['past_variables']:
                sample[k] = tmp[k][idxs-self.metadata['past_steps']:idxs]
            if k in self.metadata_dataset['future_variables']:
                sample[k] = tmp[k][idxs:idxs+self.metadata['future_steps']]
        return sample

Before continuing I will study if and how it is managed inside PTF.

Finally it is hard for me to understand where the scalers should act in the case of chunked data. But we will find a solution. If you have any comment please drop me a message!

@agobbifbk
Copy link
Collaborator

Hello there! I'm trying to fit this structure in DSIPTS:

  1. Layer D1 PandasTSDataSet easy, it is almost a copy-paste from the original implementation. There are some problems in the __getitem__ function since we can not pass strings (categorical variables) and timestamps! The first it is easy to solve, just use a LabelEncoder, the second one is more difficult, maybe again LabelEncoder? It may depend on the tipe of time we have (integers, dates, datetimes)...Moreover I thought useful to use the IDX of the getitem for retrieving only data of a certain group. This will hard affects the D2 layer implementation! But at least we are sure that one item of the D1 contains ONE time series. In the case we are reading from disk I suppose that the IDX will be another thing, like a chunk of data, that can have inside again the group that can be different among the chunks (for example group 1 can be or not in the chunk 1). I think that we can precompute the mapping chunk->groups (useful in the D2!)
  2. Layer D2: I'm finding some difficulties to think it working properly in the case of a non-PandasTSDataSet. For sure this snipped is not what we want
    def train_dataloader(self):
        return DataLoader(
            self.d1_dataset, #--> the dataloader then feeds the deep learning model, so we need to slice data before 
            batch_size=self.batch_size,
            **other_params
        )

One solution is to create a custom dataset and:

    def train_dataloader(self):
        return DataLoader(
            MyDataset(self.d1_dataset), #--> custom dataloader
            batch_size=self.batch_size,
            **other_params
        )

where:

class MyDataset(Dataset):

    def __init__(self,data,metadata)->torch.utils.data.Dataset:
       
        self.data = data  
        self.metadata = metadata
        self.metadata_dataset = data.metadata
    def __len__(self):
        
        return ?? ##hard to define now :-) 

    def __getitem__(self, idxs): ##this is the idx of the dataloader DEPENDS but it DEPENDS ON IDX ---> problem!
        sample = {}
        IDX = random.randint(0,len(ds)) ##THIS IS THE IDX OF THE TIMESERIES....RANDOM IS CORRECT???
        tmp = self.data.__getitem__(IDX)
        for k in self.data.keys:
            if k in self.metadata_dataset['past_variables']:
                sample[k] = tmp[k][idxs-self.metadata['past_steps']:idxs]
            if k in self.metadata_dataset['future_variables']:
                sample[k] = tmp[k][idxs:idxs+self.metadata['future_steps']]
        return sample

Before continuing I will study if and how it is managed inside PTF.

Finally it is hard for me to understand where the scalers should act in the case of chunked data. But we will find a solution. If you have any comment please drop me a message!

Maybe I found a solution, I will post it as soon as possible!

@agobbifbk
Copy link
Collaborator

Here https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb you can find a working example with the proposed schema, maybe it can be optimized somehow. Not clear where and how to perform the scaling and how to align it with a different TSDastaset

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 17, 2025

@agobbifbk, @phoeenniixx, interesting prototypes!

From discussion on Feb 17: we have noted that the two D1 layers have slightly different __getitem__ return interfaces, we should try to align on something that satisfies requirements of both ptf and dsipts, and also some general requirements like "small number of tensors" and simplicity.

dsipts D1 prototype by @agobbifbk and team: https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb

ptf D1 prototype by @phoeenniixx : sktime/pytorch-forecasting#1770

for comparison, early D1 prototype by @fkiraly: sktime/pytorch-forecasting#1757

@phoeenniixx
Copy link

phoeenniixx commented Feb 17, 2025

In the sktime/pytorch-forecasting#1770, the "separation" into categorical, numerical etc are not done as explicitly as in https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb,
In the PR, what we do is save the columns in the metadata that which are categorical etc, and then do the splitting in the D2 layer based on that, in this way the number of tensors returned are less

(That's what I was trying to say in the meet, my headphones battery ran off so had some mic problems T_T)

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 17, 2025

In the sktime/pytorch-forecasting#1770, the "separation" into categorical, numerical etc are not done as explicitly as in https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb, In the PR, what we do is save the columns in the metadata that which are categorical etc, and then do the splitting in the D2 layer based on that, in this way the number of tensors returned are less

Currently, I would indeed prefer that - because that way, the D2 layer (which may differ per model) can decide how it deals with categoricals, etc, and we minimize the number of tensors in D1.

What are the good reasons to keep this in D1? Genuine question.

@agobbifbk
Copy link
Collaborator

In the sktime/pytorch-forecasting#1770, the "separation" into categorical, numerical etc are not done as explicitly as in https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb, In the PR, what we do is save the columns in the metadata that which are categorical etc, and then do the splitting in the D2 layer based on that, in this way the number of tensors returned are less

I slightly disagree on that. If you filter out the unused columns in D1 you will have less data to pass to the D2....

Currently, I would indeed prefer that - because that way, the D2 layer (which may differ per model) can decide how it deals with categoricals, etc, and we minimize the number of tensors in D1.

... But if we want to create a D2 per model (I will say not per model but for type of model) maybe it can worth since in the D2 will get rid of unused columns. Maybe it will be convenient to separate at least numerical and categorical in the D1 so we can transform strings into integers directly at the beginning.

I will update my notebook according to this and try to return the less tensors as possible.

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 17, 2025

I slightly disagree on that. If you filter out the unused columns in D1 you will have less data to pass to the D2....

This sounds like a misunderstanding? I do not see any filtering happening in D1 design of @phoeenniixx. The categorical and numerical columns are simply kept in the same tensor (as floats). Please correct me if I am wrong, either of you.

Maybe it will be convenient to separate at least numerical and categorical in the D1 so we can transform strings into integers directly at the beginning.

I agree with half of this - I do not think separating the tensors is necessary for a conversion of strings to integers. The columns that were categorical strings, after conversion to integers, can be stored in the same tensor as numerical variables, as long as we keep track in the metadata wihch columns are meant to be categorical.

I am not saying this is something we must do, I am just trying to make the point that there is no logical necessity to separate the tensors.

Beyond this, I agree with @thawn on the general point that, all other things being equall, less tensors in D1 __getitem__ are a better design.

@agobbifbk
Copy link
Collaborator

agobbifbk commented Feb 18, 2025

Ok I got it! Can you please add @xandie985 to the thread? THX

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 18, 2025

Ok I got it! Can you please add @xandie985 to the thread? THX

which thread? He can simply reply here, if that is what you mean, this repository is public.

@agobbifbk
Copy link
Collaborator

Ok I got it! Can you please add @xandie985 to the thread? THX

which thread? He can simply reply here, if that is what you mean, this repository is public.

Ok sorry I did't find him using @ and autocompletition. We are working at the notebook you can find some progress here: http://localhost:8889/lab/tree/TEST_v2.ipynb Sandeep will take care of complete the missing part, after that we can compare this and the @phoeenniixx implementation!

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 19, 2025

Ok I got it! Can you please add @xandie985 to the thread? THX

which thread? He can simply reply here, if that is what you mean, this repository is public.

Ok sorry I did't find him using @ and autocompletition. We are working at the notebook you can find some progress here: http://localhost:8889/lab/tree/TEST_v2.ipynb Sandeep will take care of complete the missing part, after that we can compare this and the @phoeenniixx implementation!

two questions:

  • I think you did correctly ping Sandeep on GitHub? That was xandie985 as you wrote above.
  • the notebook link seems to your local, not to a remote

@xandie985
Copy link

Ok I got it! Can you please add @xandie985 to the thread? THX

which thread? He can simply reply here, if that is what you mean, this repository is public.

Ok sorry I did't find him using @ and autocompletition. We are working at the notebook you can find some progress here: http://localhost:8889/lab/tree/TEST_v2.ipynb Sandeep will take care of complete the missing part, after that we can compare this and the @phoeenniixx implementation!

two questions:

  • I think you did correctly ping Sandeep on GitHub? That was xandie985 as you wrote above.
  • the notebook link seems to your local, not to a remote

Hi @fkiraly and @agobbifbk its meee, Sandeep here. 👋
About the notebook, I have access to it.. Andrea was referring to this notebook which has been forked from the main DSIPTS repo.

@PranavBhatP
Copy link

Hi! @agobbifbk , you had mentioned in the last weekly-meeting about the requirement for an XArrayDataset to handle larger-than-memory dataset inputs. I'm not sure how appropriate this might be to the redesign and I'm open to comments on this, but I put some thought into it and came up with a rudimentary design with implementation for basic features of the Dataset, it currently inherits from PandasTSDataset design suggested in the TestV2.ipynb notebook. This would go with the D1 layer.

class XArrayTSDataset(PandasTSDataset):
	def __init__(
		self,
		data: xr.Dataset, # todo: add zarr handling also
		time: optional[str] = None,
		target: Optional[Union[[str], List[str]]] = None,
		group: Optional[List[str]] = None,
		num: Optional[List[Union[str, List[str]]]] = None,
		cat: Optional[List[UNion[str, List[str]]]] = None,
		know: OPtional[List[Union[str, List[str]]]] = None,
		unknown: Optional[List[Union[str, List[str]]]] = None,
		static: Optiona[List[Union[str,List[str]]]] = None,
		label_encoder: Optional[dict] = None,
		chunk_size: Optional[Dict[str,int]] = None,
	):
		"""Initial thoughts and design on an XArrayTSDataset for larger-than-memory datasets, Inherits from PandasTSDataset as suggested in the notebook, but handles data using XArray for larger_than_memory dataset. Does well with pre-computing chunk indices.
		Additional Params
		-----------------
		chunk_size: Dict[str, int], optional
			Size of chunks for each dimension, e.g {'time':100, 'group' : 10}

		
	"""
		self.xr_data = data

		if chunk_size:
			self.xr_data = self.xr_data.chunk(chunk_size)

		first_chunk = self._get_first_chunk()
		
		super().__init__(
			data=first_chunk,
			time=time,
			target=target,
			group=group,
			num=num,
			cat=cat,
			known=known,
			unknown=unknown,
			static=static,
			label_encoders=label_encoders
		)
		self.chunk_size = chunk_size
		self._setup_chunks()

	def _get_first_chunk(self) -> pd.DataFrame:
		"""Get first chunk of data as pandas DataFrame."""

		if self.chunk_size > 0:
			subset = {}
			for dim, size in self.chunk_size.items()
				subset[dim] = slice(0,size)
			return self.xr_data.isel(**subset).to_dataframe()
		
	def _setup_chunks(self):
		"""Setup chunk information for iteration on init.."""
		if not self.chunk_size:
			self.n_chunks = 1
			return

		self.chunks_per_dim = {}
		for dim in self.chunk_size:
			total_size = self.xr_data.sizes[dim]
			chunk_size = self.chunk_size[dim]
			self.chunks_per_dim[dim] = int(np.ceil(total_size /  chunk_size))

		self.n_chunks = np.prod(list(self.chunks_per_dim.values())) ##Gets the number of chunks based on the dim values.

		self.chunk_indices = list(itertools.product(
			*[range(n) for n in self.chunks_per_dim.values()]
		))

	def _get_chunk(self, chunk_idx: int) -> xr.Dataset:
		"""Getter function for chunks, gets the chunk_coords for chunk based on the chunk_idx"""

		chunks_coord = self.chunk_indices[chunk_idx]

		chunk_slice = {}

		for dim_idx, dim in enumerate(self.chunk_size.keys()):
			start = chunk_coords[dim_idx] * self.chunk_size[dim]
			end = min(start + self.chunk_size[dim], self.xr_data.sizes[dim]) 
			chunk_slices[dim] = slice(start, end)

		return self.xr_data.isel(**subset)


	def __getitem__(self, index: int) --> Dict[str, torch.Tensor]:
		"""Get time series data for a given index."""

		items_per_chunk = np.prod([self.chunk_size[dim] for dim in self.chunk_size])
		# This is a simple calculation for getting the 
		# chunk number of the current chunk.
		chunk_idx = index // items_per_chunk

		# offset
		local_idx = index % items_per_chunk

		chunk_data = self._get_chunk(chunk_idx)

		chunk_df = chunk_data.to_dataframe()

		return super().__getitem__(local_idx)

	def __len__(self) --> int:
		if not self.chunk_size:
			return super().__len__()
		total_items = np.prod([
			self.xr_data.sizes[dim] for dim in self.chunk_size
		])
		return total_items


	def _process_chunk(self, chunk: xr.Dataset) -> Dict[str, torch.Tensor]:
		"""Function to process chunks into Tensors"""

	def to_pandas(self) -> pd.DataFrame:
		"""Dumps the xarr data into memory, just including some functions to show the possible usecases for this class."""

@agobbifbk
Copy link
Collaborator

Nice! I think a lot of stuff can not be inherited from PandasTSDataset (for example label encoder) but it is a starting point. We can implement it as separate class and then see if there are common part (in that case we can have a generic Dataset and then the pandas and xarray will derive from that).
Can you also add an example where you have a pandas dataframe and convert it to a xarray chunked by time and groups? You can start from TEST_v2 dataset and try to work around it! THX

@fkiraly
Copy link
Contributor Author

fkiraly commented Feb 21, 2025

@PranavBhatP, agree with @agobbifbk that the mid-term design is probably best to inherit from common base classes for the D1 dataset rather than from the pandas dataset.

I also think the class is nice, and proves how a larger-than-memory input could be handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

6 participants