-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch-forecasting
and dsipts
v2 API design
#39
base: main
Are you sure you want to change the base?
Conversation
(bringing the discussion from discord to here) |
About Given that, should this be positioned at "model" level instead? |
Yes, that is also what @thawn said, let's try to have as few tensors as possible. My answer: perhaps - but it is hard to actually write it down. Some comments:
|
I think not, since it is a "data" concern:
|
What I'm thinking about is that static variables are a specific kind of past and future known exogenous variables, that are constant throughout time. In this sense, should we really separate them? I feel they are essentially the same in a "meta" level. The difference is how the model wants to use them. One example: we could use fourier features as static variables in TFT. I'm not sure if there's something that do not allow us to use it that way, since these features are known for every timepoint in past and future. |
More thoughts:
|
To me, this is not correct, on the conceptual level. The formal model I typically adopt is that all data points are indexed. Time-varying (or "dynamic") variables change over time, and obervations (entries, values, etc) are indexed by: (instance index, variable name, time index). Whereas, static variables are only indexed by (instance index, variable name). If you adopt the usual model of hierarchical data, where a combination of indices indexes an observation, it becomes clear that static variables live at different "levels" in the indexing hierarchy. Mapping static variables on a column on the lower level (plus time index) has them being constant as a side effect in representation, rather than as a conceptual feature. |
There are two ideas here, I believe: (a) nesting metadata, and (b) returning metadata not by I am very open to (a) and think this is perhaps a good idea. Regarding (b), could you explain how the retrieval of the metadata would work? The
Yes, that would be a great idea - do you want to draw up a design? And, surely this exists elsewhere, too? |
I understand your point that, since they can be indexed by different amout of indexes while keeping the information, they are and can be classified as different things. Nevertheless, I believe that the challenge here is projecting this higher dimensional space of concepts (e.g. classifying variables by instance, variable name, time index, discrete or continuous, etc) into a lower dimensional space that still makes the task (implementing different NN-based forecasters) feasible. In this regard, the separation of data in 1. target variable, 2. features with past knowns, 3. features with past/future knowns is possibly the lowest dimensional representation, and that can be further subdivided into other categories/concepts, such as features with past/future knowns that are static and do not change overtime (we can imagine it as a tree).
I thinking in the opposite direction: static variables are features, they belong to this "feature" concept: relevant data other than the target variable used as input to predict the target variable. They are a special subset of features, that do not change over time, but this does not place them outside the set of features. In particular, being part of this subset allows them to be represented with less index levels as you mentioned. Some models, such as DeepAR, may still want to use "static variables" as exogenous (I believe this is the case, by inspecting the implementation and the property |
good point... If this is a direction we would like to go, I believe there would be possibly two solutions:
|
But, this is not an intrinsic property of the static varibales. It is how the model chooses to use them. So it is not a data property but a modelling choice, and specifically a "reduction" choice where we reformat the data (like in the reduction forecaster, just a different type of making the data "rectangular-formatted") Therefore, with the domain driven design philosophy, the choice of reformatting the static variables falls with the model rather than the data layer.
I wanted to do that too but was voted down and convinced by good arguments in the Jan 20 meeting, reference: sktime/pytorch-forecasting#1736 The key argument for me was, current vignettes in all packages reviewed rely on the default
I think that is a hypothetical possibility but a bad idea. We would be rewriting layers outside the scope of the package. I still have nightmares from our attempts to rewrite @prockenschaub did great work, and I think our designs would have been successful in the end - it just turned out we were getting pulled into the scope of rewriting The lesson from that is, I suppose, if you are building a house and suddenly find yourself wanting to tear down your neighbour's and then the entire street, there is maybe something architectural to rethink. |
Hi Everyone! I have an idea for D1 layer, not sure if this will work but I thought to reduce the metadata by using dicts. D1
In this way we have a little less metadata |
As we are using just (Although not sure how much of it is feasible😅) example: class MyModel(LightningModule):
def setup(self, stage=None):
dataset = self.trainer.datamodule.train_dataset
self.col_map = dataset.col_map
self.st_col_type = dataset.st_col_type
self.y_types = dataset.y_types
self.x_types = dataset.x_types
def forward(self, x):
# Now we have metadata available for processing for forward
pass
def training_step(self, batch, batch_idx):
# training logic
pass
More examples: https://lightning.ai/docs/pytorch/stable/data/datamodule.html#what-is-a-datamodule Or we can do this: class DecoderEncoderData(Dataset):
def __init__(self, tsd: PandasTSDataSet, **params):
self.tsd = tsd # Store dataset reference
# Access metadata from dataset (D1)
self.col_map = tsd.col_map
self.st_col_type = tsd.st_col_type
self.y_types = tsd.y_types
self.x_types = tsd.x_types
def __getitem__(self, idx):
sample = self.tsd[idx]
return sample
|
A doubt: are we using |
What I meant is not using this tensor in every part of the code. Just having a code encapsulating the input data tensor with this metadata. This something documented in pytorch's documentation
I would argue that "static" features are not a real phenomenon; they are a separation made in the original paper. Nothing impedes the usage of other past and future known features in the same way that "static" features are used. One example of a use case that might encounter difficulties when separating static and non-static variables in the data layer is as follows: What if I want to use non-constant features (such as Fourier terms, which are known for both the past and future) in the same way that static variables are used in TFT? This approach could be interesting if we want the gating mechanisms to depend on seasonality. For example, holidays have different effects depending on the day of the week, and using sine and cosine functions as static features could be helpful in this scenario. |
hi @phoeenniixx ! I believe it is |
Thanks for clarifying this! @felipeangelimvieira |
@phoeenniixx, thanks a lot for your useful thoughts!
I think the latter is a rebrand of the former, the package and hte project have been renamed.
Interesting - how would the entire workflow look like if we do this? Though, using
That's a very interesting idea! Following up on this, I would actually put even more in single dicts, and use strings (like in pandas) for types: D1
|
Thanks for the comments @fkiraly !
Well I have not thought this through but we can do something like this: # Layer D1
class PandasTSDataSet:
def __init__(self, df, y_cols, x_cols, st_cols, y_types, x_types, st_types):
self.df = df
self.metadata = [metadata]
def __getitem__(self, idx):
# Returns only tensors, no metadata
return {
't': tensor(...),
'y': tensor(...),
'x': tensor(...),
'group': tensor(...),
'st': tensor(...),
't_f': tensor(...),
'x_f': tensor(...),
}
def get_metadata(self):
return self.metadata
# Layer D2
class DecoderEncoderDataModule(LightningDataModule):
def __init__(self, d1_dataset, batch_size=32, num_workers=4):
super().__init__()
# initialize other params
self.d1_dataset = d1_dataset
self.batch_size = batch_size
self.metadata = None # Will be set in setup()
def setup(self, stage=None):
# Get metadata from D1 layer during setup
self.metadata = self.d1_dataset.get_metadata()
if stage == 'fit' or stage is None:
# Any train-specific setup
pass
if stage == 'test' or stage is None:
# Any test-specific setup
pass
def train_dataloader(self):
return DataLoader(
self.d1_dataset,
batch_size=self.batch_size,
**other_params
)
# Layer M
class DecoderEncoderModel(LightningModule):
def __init__(self):
super().__init__()
self.metadata = None
def setup(self, stage=None):
# Get metadata from datamodule during setup
self.metadata = self.trainer.datamodule.metadata
# Initialize layer T model using metadata
def forward(self, x):
# forward logic
pass
Well yes, but I found this was the closest way where data and related fields are handled in the model. Or as I mentioned earlier, we can just pass the metadata directly to other layers by doing something similar to this. Here we can even define a function like class DecoderEncoderData(Dataset):
def __init__(self, tsd: PandasTSDataSet, **params):
self.tsd = tsd # Store dataset reference
# Access metadata from dataset (D1)
self.metadata = tsd.get_metadata()
def __getitem__(self, idx):
sample = self.tsd[idx]
# other required additions to ``sample``
return sample In this way, we don't deviate from
Yeah! we can do that as well. Didn't strike me at that time. Thanks! |
@phoeenniixx, brilliant! Great thinking! In restrospect - looking at your post, it looks like with layer D2 we were trying to reinvent the Nice! I will need some time to digest, but I think we are getting close now. |
Thanks! @fkiraly
Well I didn't fiind anyother way in |
Hello there!
def train_dataloader(self):
return DataLoader(
self.d1_dataset, #--> the dataloader then feeds the deep learning model, so we need to slice data before
batch_size=self.batch_size,
**other_params
) One solution is to create a custom dataset and: def train_dataloader(self):
return DataLoader(
MyDataset(self.d1_dataset), #--> custom dataloader
batch_size=self.batch_size,
**other_params
) where: class MyDataset(Dataset):
def __init__(self,data,metadata)->torch.utils.data.Dataset:
self.data = data
self.metadata = metadata
self.metadata_dataset = data.metadata
def __len__(self):
return ?? ##hard to define now :-)
def __getitem__(self, idxs): ##this is the idx of the dataloader DEPENDS but it DEPENDS ON IDX ---> problem!
sample = {}
IDX = random.randint(0,len(ds)) ##THIS IS THE IDX OF THE TIMESERIES....RANDOM IS CORRECT???
tmp = self.data.__getitem__(IDX)
for k in self.data.keys:
if k in self.metadata_dataset['past_variables']:
sample[k] = tmp[k][idxs-self.metadata['past_steps']:idxs]
if k in self.metadata_dataset['future_variables']:
sample[k] = tmp[k][idxs:idxs+self.metadata['future_steps']]
return sample Before continuing I will study if and how it is managed inside PTF. Finally it is hard for me to understand where the scalers should act in the case of chunked data. But we will find a solution. If you have any comment please drop me a message! |
Maybe I found a solution, I will post it as soon as possible! |
Here https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb you can find a working example with the proposed schema, maybe it can be optimized somehow. Not clear where and how to perform the scaling and how to align it with a different TSDastaset |
@agobbifbk, @phoeenniixx, interesting prototypes! From discussion on Feb 17: we have noted that the two D1 layers have slightly different dsipts D1 prototype by @agobbifbk and team: https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb ptf D1 prototype by @phoeenniixx : sktime/pytorch-forecasting#1770 for comparison, early D1 prototype by @fkiraly: sktime/pytorch-forecasting#1757 |
In the sktime/pytorch-forecasting#1770, the "separation" into categorical, numerical etc are not done as explicitly as in https://github.com/agobbifbk/DSIPTS_PTF/blob/main/notebooks/TEST.ipynb, (That's what I was trying to say in the meet, my headphones battery ran off so had some mic problems T_T) |
Currently, I would indeed prefer that - because that way, the D2 layer (which may differ per model) can decide how it deals with categoricals, etc, and we minimize the number of tensors in D1. What are the good reasons to keep this in D1? Genuine question. |
I slightly disagree on that. If you filter out the unused columns in D1 you will have less data to pass to the D2....
... But if we want to create a D2 per model (I will say not per model but for type of model) maybe it can worth since in the D2 will get rid of unused columns. Maybe it will be convenient to separate at least numerical and categorical in the D1 so we can transform strings into integers directly at the beginning. I will update my notebook according to this and try to return the less tensors as possible. |
This sounds like a misunderstanding? I do not see any filtering happening in D1 design of @phoeenniixx. The categorical and numerical columns are simply kept in the same tensor (as floats). Please correct me if I am wrong, either of you.
I agree with half of this - I do not think separating the tensors is necessary for a conversion of strings to integers. The columns that were categorical strings, after conversion to integers, can be stored in the same tensor as numerical variables, as long as we keep track in the metadata wihch columns are meant to be categorical. I am not saying this is something we must do, I am just trying to make the point that there is no logical necessity to separate the tensors. Beyond this, I agree with @thawn on the general point that, all other things being equall, less tensors in D1 |
Ok I got it! Can you please add @xandie985 to the thread? THX |
which thread? He can simply reply here, if that is what you mean, this repository is public. |
Ok sorry I did't find him using @ and autocompletition. We are working at the notebook you can find some progress here: http://localhost:8889/lab/tree/TEST_v2.ipynb Sandeep will take care of complete the missing part, after that we can compare this and the @phoeenniixx implementation! |
two questions:
|
Hi @fkiraly and @agobbifbk its meee, Sandeep here. 👋 |
Hi! @agobbifbk , you had mentioned in the last weekly-meeting about the requirement for an XArrayDataset to handle larger-than-memory dataset inputs. I'm not sure how appropriate this might be to the redesign and I'm open to comments on this, but I put some thought into it and came up with a rudimentary design with implementation for basic features of the Dataset, it currently inherits from PandasTSDataset design suggested in the TestV2.ipynb notebook. This would go with the D1 layer. class XArrayTSDataset(PandasTSDataset):
def __init__(
self,
data: xr.Dataset, # todo: add zarr handling also
time: optional[str] = None,
target: Optional[Union[[str], List[str]]] = None,
group: Optional[List[str]] = None,
num: Optional[List[Union[str, List[str]]]] = None,
cat: Optional[List[UNion[str, List[str]]]] = None,
know: OPtional[List[Union[str, List[str]]]] = None,
unknown: Optional[List[Union[str, List[str]]]] = None,
static: Optiona[List[Union[str,List[str]]]] = None,
label_encoder: Optional[dict] = None,
chunk_size: Optional[Dict[str,int]] = None,
):
"""Initial thoughts and design on an XArrayTSDataset for larger-than-memory datasets, Inherits from PandasTSDataset as suggested in the notebook, but handles data using XArray for larger_than_memory dataset. Does well with pre-computing chunk indices.
Additional Params
-----------------
chunk_size: Dict[str, int], optional
Size of chunks for each dimension, e.g {'time':100, 'group' : 10}
"""
self.xr_data = data
if chunk_size:
self.xr_data = self.xr_data.chunk(chunk_size)
first_chunk = self._get_first_chunk()
super().__init__(
data=first_chunk,
time=time,
target=target,
group=group,
num=num,
cat=cat,
known=known,
unknown=unknown,
static=static,
label_encoders=label_encoders
)
self.chunk_size = chunk_size
self._setup_chunks()
def _get_first_chunk(self) -> pd.DataFrame:
"""Get first chunk of data as pandas DataFrame."""
if self.chunk_size > 0:
subset = {}
for dim, size in self.chunk_size.items()
subset[dim] = slice(0,size)
return self.xr_data.isel(**subset).to_dataframe()
def _setup_chunks(self):
"""Setup chunk information for iteration on init.."""
if not self.chunk_size:
self.n_chunks = 1
return
self.chunks_per_dim = {}
for dim in self.chunk_size:
total_size = self.xr_data.sizes[dim]
chunk_size = self.chunk_size[dim]
self.chunks_per_dim[dim] = int(np.ceil(total_size / chunk_size))
self.n_chunks = np.prod(list(self.chunks_per_dim.values())) ##Gets the number of chunks based on the dim values.
self.chunk_indices = list(itertools.product(
*[range(n) for n in self.chunks_per_dim.values()]
))
def _get_chunk(self, chunk_idx: int) -> xr.Dataset:
"""Getter function for chunks, gets the chunk_coords for chunk based on the chunk_idx"""
chunks_coord = self.chunk_indices[chunk_idx]
chunk_slice = {}
for dim_idx, dim in enumerate(self.chunk_size.keys()):
start = chunk_coords[dim_idx] * self.chunk_size[dim]
end = min(start + self.chunk_size[dim], self.xr_data.sizes[dim])
chunk_slices[dim] = slice(start, end)
return self.xr_data.isel(**subset)
def __getitem__(self, index: int) --> Dict[str, torch.Tensor]:
"""Get time series data for a given index."""
items_per_chunk = np.prod([self.chunk_size[dim] for dim in self.chunk_size])
# This is a simple calculation for getting the
# chunk number of the current chunk.
chunk_idx = index // items_per_chunk
# offset
local_idx = index % items_per_chunk
chunk_data = self._get_chunk(chunk_idx)
chunk_df = chunk_data.to_dataframe()
return super().__getitem__(local_idx)
def __len__(self) --> int:
if not self.chunk_size:
return super().__len__()
total_items = np.prod([
self.xr_data.sizes[dim] for dim in self.chunk_size
])
return total_items
def _process_chunk(self, chunk: xr.Dataset) -> Dict[str, torch.Tensor]:
"""Function to process chunks into Tensors"""
def to_pandas(self) -> pd.DataFrame:
"""Dumps the xarr data into memory, just including some functions to show the possible usecases for this class.""" |
Nice! I think a lot of stuff can not be inherited from PandasTSDataset (for example label encoder) but it is a starting point. We can implement it as separate class and then see if there are common part (in that case we can have a generic Dataset and then the pandas and xarray will derive from that). |
@PranavBhatP, agree with @agobbifbk that the mid-term design is probably best to inherit from common base classes for the D1 dataset rather than from the pandas dataset. I also think the class is nice, and proves how a larger-than-memory input could be handled. |
An enhancement proposal and design document towards v2 of
pytorch-forecasting
anddsipts
.More context here: sktime/pytorch-forecasting#1736