-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tsd
refactoring
#173
Tsd
refactoring
#173
Conversation
Hi Luigi, I should really publish a roadmap somewhere to help contributors get a sense of where this is going. In fact, you are right there is a lot of redundancy in the core when it comes to methods and I am working on simplifying it right now. Overall, I was thinking the architecture of the core could be something like this :
Any slicing of a TsdTensor after the first dimension can then return a TsdFrame or Tsd depending of the shape. Let me know what you think. |
Hi @gviejo ! Yes, it would actually be awesome to have, even a super sparse one, just to get the sense of what you plan to address short-and mid-term! I would very happy to give feedback and help out with the implementation, given how useful this is. Glad to hear that we share observations on the old core implementation! I agree that pandas could be a bit brittle - I noticed in this refactoring! - but I think some things already improved with a more careful definition of init and constructors/slice constructors. Can you give a sense of how strong are the inconsistencies you come across? In terms of where you are suggesting to move, I see the point you are making regarding this option of going numpy-like. However, there's something that I am not sure I get: if we end up having NDArray-based objects, this would be great for uniform data (spikes/movies) but would be quite hard to implement a pandas-like interface for mixed data times like ordinary data frames. Is there a smooth way to take care of this with the strategy you are envisioning? From what I understand it would be basically re-implement convenient parts of I see how you would always have an I agree that N-dimensionality would be powerful to have. However, if I think about my kind of data types, I always end up working with extracted positions / postures over time, or fluorescence traces, so semantic columns would be way more important than N-dimensions that most times are dropped early in the data analysis. But maybe I am missing if there's a simple way to keep all of that with the strategy you are suggesting! |
About Pandas, the version 2.0.0 is out and it does not play very well anymore. Like this tsdframe = nap.TsdFrame(t = np.arange(100), d = np.random.rand(100, 3))
tsdframe.mean(1) works for pandas<2.0.0 but not pandas>=2.0.0. I think they changed their backend to pyarrow and I have yet to figure out why. But the problem with relying on pandas is that this situation could potentially repeat. Therefore numpy seems to me a safer choice for long term stability. Other packages like cupy and dask have successfuly used numpy array containers approach therefore we can expect more support in the future for this. So instead on focusing on debugging pandas, I rather focus on transposing the backend of pynapple directly to numpy. Also some pynapple dependencies have also capped pandas to <2.0.0 therefore it might triggers some incompatibilities. The issues you are raising are valid in fact
|
Hi @gviejo, thank you for the additional explanations! I agree that pandas is probably still changing a bit more than numpy, although I guess they'll be plateauing soon to numpy levels in terms of stability. As per your example, although I understand that not having it working is annoying, I also have to say that computing mean across all columns of a dataframe is not such a common operation for me, and I think that given the highly Here are more inputs on those points - I still do not have a clear inclination atm, just brainstorming:
Let me know what you think! |
Regarding the semantic indexing, what about this behavior? In [2]: tsdframe = nap.TsdFrame(t = np.arange(10), d = np.random.rand(10, 3), columns=['a', 'b', 'c'])
In [3]: tsdframe
Out[3]:
Time (s) a b c
0.000000 0.216387 0.828438 0.932349
1.000000 0.527450 0.211969 0.467524
2.000000 0.660210 0.899291 0.998269
3.000000 0.459685 0.254648 0.325075
4.000000 0.381232 0.068764 0.305757
5.000000 0.869709 0.592541 0.713998
6.000000 0.279989 0.714232 0.697950
7.000000 0.317931 0.548663 0.145560
8.000000 0.765320 0.376656 0.996353
9.000000 0.966709 0.259720 0.947387
dtype: float64
In [4]: tsdframe[0:3,2]
Out[4]:
Time (s)
0.000000 0.932349
1.000000 0.467524
2.000000 0.998269
Length: 3, dtype: float64
In [5]: tsdframe[:,0:2]
Out[5]:
Time (s) a b
0.000000 0.216387 0.828438
1.000000 0.527450 0.211969
2.000000 0.660210 0.899291
3.000000 0.459685 0.254648
4.000000 0.381232 0.068764
5.000000 0.869709 0.592541
6.000000 0.279989 0.714232
7.000000 0.317931 0.548663
8.000000 0.765320 0.376656
9.000000 0.966709 0.259720
dtype: float64
In [6]: tsdframe.loc[['a', 'c']]
Out[6]:
Time (s) a c
0.000000 0.216387 0.932349
1.000000 0.527450 0.467524
2.000000 0.660210 0.998269
3.000000 0.459685 0.325075
4.000000 0.381232 0.305757
5.000000 0.869709 0.713998
6.000000 0.279989 0.697950
7.000000 0.317931 0.145560
8.000000 0.765320 0.996353
9.000000 0.966709 0.947387
dtype: float64
In [8]: tsdframe.loc['b']
Out[8]:
Time (s) b
0.000000 0.828438
1.000000 0.211969
2.000000 0.899291
3.000000 0.254648
4.000000 0.068764
5.000000 0.592541
6.000000 0.714232
7.000000 0.548663
8.000000 0.376656
9.000000 0.259720
dtype: float64
loc would be specific to the TsdFrame class and tsdframe.columns is actually a pandas Index object. |
This indexing behavior looks good to me! |
This does look very good to me! If it is not painful to get it to be solid for set and get, I think it can be a good option! |
For the setter, do you mean something like this ? In [40]: tsd
Out[40]:
Time (s)
0.000000 0.000000
1.000000 0.000000
2.000000 0.000000
3.000000 0.000000
4.000000 0.000000
5.000000 0.000000
6.000000 0.000000
7.000000 0.000000
8.000000 0.000000
9.000000 0.000000
Length: 10, dtype: float64
In [41]: tsd[0] = 1
In [42]: tsd
Out[42]:
Time (s)
0.000000 1.000000
1.000000 0.000000
2.000000 0.000000
3.000000 0.000000
4.000000 0.000000
5.000000 0.000000
6.000000 0.000000
7.000000 0.000000
8.000000 0.000000
9.000000 0.000000
Length: 10, dtype: float64 |
Yes! (with semantic indexing as well). If you then plan to go forward with this Happy to beta-test one it is at feature parity if it can be helpful! |
Apparently it's not possible to convert a PR to an issue. Feel free to test it. I still need to update a lot of docs around it. |
Thanks, great! Docs aside, It is not already good right? I still have some testing errors, do not want to give feedback before it's time! Minor request: do you think it would be possibile to add the option of passing data to time series as lists? it was one of the things I was adding in this PR, could be convenient! |
Which tests failed? I am still working on |
Edit: never mind, I was being stupid with forks synch :) |
Other feedbacks to smooth back-compatibility (maybe you are already addressing them for numpy compatibility though):
def copy(self) -> np.ndarray:
return self.__class__(d=self.values.copy(), t=self.index.copy()) ?
|
Yes for the copy method. I can add it now. For pd.concat, it's a good question. My approach was to prevent any operations that would increase the shape or dims of the object so that you don't mess up with the time index. You can only slice down basically. But maybe it's too restrictive. |
So, the usage I have is super basic, I just have files with consistent time indexes across each other (i.e. first 0-20, second 21-40, etc) and I want to join them, in my case no reindexing would be needed |
Question on the new indexing: would you consider the option of having strings interpreted as columns names? Like, being able to do |
Yes that is doable. |
Great!
Yep, fully agree! |
Also, it would be good to have the string indexing to support item assignment: tsd["b"] = some_operation_on(tsd["a"]) |
The branch |
Hi can I close this pull request? I think most of the features we discussed here have been implemented. |
Yup! |
Hi,
I wanted to start working on #169, but while doing so I began to wonder if the time classes could be streamlined a bit to reduce double implementation.
A lot (the most part actually) of the classes is diplication that could be avoided having a mix-in class that implements all the common methods shared across time classes. While I do realise that in general multiple inheritance is not the best pattern, that in this case it is unavoidable to me because of the direct
pandas
subclassing, and it seems to be still much better than having large chunks of duplicated code around.I also slightly simplified the reimplementation scheme for the Series/DataFrame built-in methods. I realised that the
_constructor
property was failing to handle correctly new object constructions with eg arithmetical operations just because of args order in theTsd
init (when creating a Series,pandas
assumes values as first arg,pynapple
was inferring time).Therefore, I've put data as first input to
Tsd
, which makes sense to me as there should always be data there, right? If there is no data passed, then aTs
object is what is probably the user wants. I was reassured in my understanding by the fact the all tests were running after the change. But let me know if you feel that this could impact some un-tested usage. The current solution is way cleaner as being more pandas-like it avoids reimplementing a lot ob builtin methods.To be more coherent with pandas, I added the option for passing lists instead of arrays, and a related test. That can go if you don't like it though!
Let me know what you think! If this is good, I will next implement the loading in the mix-in class! All tests are currently working.
Also, maybe in this PR after you had a look, I would like to
black
the tests as well and add them to the linting , would that be ok? I think it make sense to keep all code tight.