Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tsd refactoring #173

Closed
wants to merge 7 commits into from
Closed

Conversation

vigji
Copy link
Collaborator

@vigji vigji commented Sep 12, 2023

Hi,

I wanted to start working on #169, but while doing so I began to wonder if the time classes could be streamlined a bit to reduce double implementation.

A lot (the most part actually) of the classes is diplication that could be avoided having a mix-in class that implements all the common methods shared across time classes. While I do realise that in general multiple inheritance is not the best pattern, that in this case it is unavoidable to me because of the direct pandas subclassing, and it seems to be still much better than having large chunks of duplicated code around.

I also slightly simplified the reimplementation scheme for the Series/DataFrame built-in methods. I realised that the _constructor property was failing to handle correctly new object constructions with eg arithmetical operations just because of args order in the Tsd init (when creating a Series, pandas assumes values as first arg, pynapple was inferring time).
Therefore, I've put data as first input to Tsd, which makes sense to me as there should always be data there, right? If there is no data passed, then a Ts object is what is probably the user wants. I was reassured in my understanding by the fact the all tests were running after the change. But let me know if you feel that this could impact some un-tested usage. The current solution is way cleaner as being more pandas-like it avoids reimplementing a lot ob builtin methods.

To be more coherent with pandas, I added the option for passing lists instead of arrays, and a related test. That can go if you don't like it though!

Let me know what you think! If this is good, I will next implement the loading in the mix-in class! All tests are currently working.

Also, maybe in this PR after you had a look, I would like to black the tests as well and add them to the linting , would that be ok? I think it make sense to keep all code tight.

@vigji vigji requested a review from gviejo as a code owner September 12, 2023 06:32
@gviejo
Copy link
Contributor

gviejo commented Sep 12, 2023

Hi Luigi, I should really publish a roadmap somewhere to help contributors get a sense of where this is going. In fact, you are right there is a lot of redundancy in the core when it comes to methods and I am working on simplifying it right now.
The big update I would like to do is to remove pandas in fact. Pandas is merely a place holder right now and I find it hard to control it's behavior for some calls (especially since Pandas 2.0.0). The idea is to create custom array containers as described here in numpy doc.
I tested it in the branch new_core and it plays nicely with the numpy API. It's also easier to control the returned object.
The other big advantage is that we can now define time series in more than 2 dimensions (TsdTensor for the name?) which will be useful for movies for examples.

Overall, I was thinking the architecture of the core could be something like this :

Abstract Class AbstractTsd(NDArrayOperatorsMixin)
|
├─ Concrete Subclass TsdTensor
│       │
│       └─ Concrete SubClass TsdFrame -> Can convert to pandas.DataFrame
│       │
│       └─ Concrete SubClass Tsd -> Can convert to pandas.Series
│       
└─ Concrete SubClass Ts -> the attribute `values` is empty.
        

Any slicing of a TsdTensor after the first dimension can then return a TsdFrame or Tsd depending of the shape. Let me know what you think.
I need to test to see if the abstract class plays nicely. What needs to be carefully tuned is __array_ufunc__ and __array_function__ but so far arithmetical operations works as well as for example np.mean(tsd).

@vigji
Copy link
Collaborator Author

vigji commented Sep 12, 2023

Hi @gviejo !

Yes, it would actually be awesome to have, even a super sparse one, just to get the sense of what you plan to address short-and mid-term! I would very happy to give feedback and help out with the implementation, given how useful this is.

Glad to hear that we share observations on the old core implementation! I agree that pandas could be a bit brittle - I noticed in this refactoring! - but I think some things already improved with a more careful definition of init and constructors/slice constructors. Can you give a sense of how strong are the inconsistencies you come across?

In terms of where you are suggesting to move, I see the point you are making regarding this option of going numpy-like. However, there's something that I am not sure I get: if we end up having NDArray-based objects, this would be great for uniform data (spikes/movies) but would be quite hard to implement a pandas-like interface for mixed data times like ordinary data frames. Is there a smooth way to take care of this with the strategy you are envisioning? From what I understand it would be basically re-implement convenient parts of pandas, like semantic column indexing, efficient mixed-types memory usage, grouping operations...

I see how you would always have an as_dataframe() option, but arrays are more restrictive classes than data frames, so you will always end up losing representational power going ndarray->dataframe (unless you reimplement half pandas which sounds painful).

I agree that N-dimensionality would be powerful to have. However, if I think about my kind of data types, I always end up working with extracted positions / postures over time, or fluorescence traces, so semantic columns would be way more important than N-dimensions that most times are dropped early in the data analysis.

But maybe I am missing if there's a simple way to keep all of that with the strategy you are suggesting!

@gviejo
Copy link
Contributor

gviejo commented Sep 13, 2023

About Pandas, the version 2.0.0 is out and it does not play very well anymore. Like this

tsdframe = nap.TsdFrame(t = np.arange(100), d = np.random.rand(100, 3))
tsdframe.mean(1)

works for pandas<2.0.0 but not pandas>=2.0.0. I think they changed their backend to pyarrow and I have yet to figure out why. But the problem with relying on pandas is that this situation could potentially repeat. Therefore numpy seems to me a safer choice for long term stability. Other packages like cupy and dask have successfuly used numpy array containers approach therefore we can expect more support in the future for this. So instead on focusing on debugging pandas, I rather focus on transposing the backend of pynapple directly to numpy. Also some pynapple dependencies have also capped pandas to <2.0.0 therefore it might triggers some incompatibilities.

The issues you are raising are valid in fact

  • semantic column indexing : this is an annoying one. In the new version i am testing, indexing behaves only likes numpy array. If you do tsdframe[0] in the new version, it returns one element along the first axis. This kind of indexing df['col-name'] is more tricky but I supposed it is what you would like to have? It is easy to have an attributes columns but how to use them. One solution I was thinking is to have a function tsdframe.get("col-name") or tsdframe.getcol("col-name") but it's different from pandas.
  • efficient mixed-types memory usage: is there am example case where this is non-trivial in system neuroscience? I have the intuition that if you need a dataframe with mixed types for a time series, you probably don't need pynapple functionalities and you better stay in pandas.
  • grouping operations: What kind of grouping operations would you need? One possibility is to do tsdframe.as_dataframe.groupby. I suppose if you do a groupby operation, you probably don't care about the time support anymore?

@vigji
Copy link
Collaborator Author

vigji commented Sep 13, 2023

Hi @gviejo, thank you for the additional explanations!

I agree that pandas is probably still changing a bit more than numpy, although I guess they'll be plateauing soon to numpy levels in terms of stability.

As per your example, although I understand that not having it working is annoying, I also have to say that computing mean across all columns of a dataframe is not such a common operation for me, and I think that given the highly numpy-ee syntax of an operation like tsdframe.mean(1), tsdframe.values.mean(1) does not look horrible!

Here are more inputs on those points - I still do not have a clear inclination atm, just brainstorming:

  • for me semantic indexing is crucial; if you end up implementing an interface that is not as smooth as pandas for getting eg. columns from a dataframe mapping experiment quantities over time, it could get quite off-putting for users; most are experienced with pandas and it is part of what makes nap classes neat. Having to work with objects that are mostly arrays and that you would have to index with new dedicated methods could be cumbersome; full implementation in numpy could be tricky (but maybe I am overthinking, and it is actually reasonable to include in the __getitem__ a behavior that reasonably support this at least for string indexes, and some "pandas-emulating" .loc[]).
  • what I was focusing on with 2 was not the efficiency, but having multiple types in the same object. For me it is common to keep track of multiple experiment quantities with different types, and even if from what I've tested this atm seems to cause troubles for the jitted jitvaluefromtsdframe, it feels addressable with pandas (I am actually not sure I understand the current logic of computing both new index and slicing in the jitted function, as opposed to compute there the indexes and use them to slice in pandas'ee way, which would solve the issue here). It would however not be really addressable with numpy. Agree that this is not crucial, although it could be convenient to have.
  • grouping operations: I agree that grouping often being "terminal" w/r/t time, it would not be an issue if what comes out of that has lost time support features. The interface could also be implemented to look like the usual pandas methods, and the methods just call as_dataframe under the hood.

Let me know what you think!

@gviejo
Copy link
Contributor

gviejo commented Sep 15, 2023

Regarding the semantic indexing, what about this behavior?

In [2]: tsdframe = nap.TsdFrame(t = np.arange(10), d = np.random.rand(10, 3), columns=['a', 'b', 'c'])                                                              
                                                                                                                                                                    
In [3]: tsdframe                                                                                                                                                    
Out[3]:                                                                                                                                                             
Time (s)      a        b        c
0.000000    0.216387 0.828438 0.932349
1.000000    0.527450 0.211969 0.467524
2.000000    0.660210 0.899291 0.998269
3.000000    0.459685 0.254648 0.325075
4.000000    0.381232 0.068764 0.305757
5.000000    0.869709 0.592541 0.713998
6.000000    0.279989 0.714232 0.697950
7.000000    0.317931 0.548663 0.145560
8.000000    0.765320 0.376656 0.996353
9.000000    0.966709 0.259720 0.947387
dtype: float64

In [4]: tsdframe[0:3,2]                                                                                                                                             
Out[4]:                                                                                                                                                             
Time (s)
0.000000    0.932349
1.000000    0.467524
2.000000    0.998269
Length: 3, dtype: float64

In [5]: tsdframe[:,0:2]                                                                                                                                             
Out[5]:                                                                                                                                                             
Time (s)      a        b
0.000000    0.216387 0.828438
1.000000    0.527450 0.211969
2.000000    0.660210 0.899291
3.000000    0.459685 0.254648
4.000000    0.381232 0.068764
5.000000    0.869709 0.592541
6.000000    0.279989 0.714232
7.000000    0.317931 0.548663
8.000000    0.765320 0.376656
9.000000    0.966709 0.259720
dtype: float64

In [6]: tsdframe.loc[['a', 'c']]                                                                                                                                    
Out[6]:                                                                                                                                                             
Time (s)      a        c
0.000000    0.216387 0.932349
1.000000    0.527450 0.467524
2.000000    0.660210 0.998269
3.000000    0.459685 0.325075
4.000000    0.381232 0.305757
5.000000    0.869709 0.713998
6.000000    0.279989 0.697950
7.000000    0.317931 0.145560
8.000000    0.765320 0.996353
9.000000    0.966709 0.947387
dtype: float64

In [8]: tsdframe.loc['b']                                                                                                                                           
Out[8]:                                                                                                                                                             
Time (s)      b
0.000000    0.828438
1.000000    0.211969
2.000000    0.899291
3.000000    0.254648
4.000000    0.068764
5.000000    0.592541
6.000000    0.714232
7.000000    0.548663
8.000000    0.376656
9.000000    0.259720
dtype: float64

loc would be specific to the TsdFrame class and tsdframe.columns is actually a pandas Index object.

@dlevenstein
Copy link
Contributor

This indexing behavior looks good to me!

@vigji
Copy link
Collaborator Author

vigji commented Sep 15, 2023

This does look very good to me! If it is not painful to get it to be solid for set and get, I think it can be a good option!

@gviejo
Copy link
Contributor

gviejo commented Sep 15, 2023

For the setter, do you mean something like this ?

In [40]: tsd                                                                                                                                                        
Out[40]:                                                                                                                                                            
Time (s)
0.000000    0.000000
1.000000    0.000000
2.000000    0.000000
3.000000    0.000000
4.000000    0.000000
5.000000    0.000000
6.000000    0.000000
7.000000    0.000000
8.000000    0.000000
9.000000    0.000000
Length: 10, dtype: float64

In [41]: tsd[0] = 1                                                                                                                                                 
                                                                                                                                                                    
In [42]: tsd                                                                                                                                                        
Out[42]:                                                                                                                                                            
Time (s)
0.000000    1.000000
1.000000    0.000000
2.000000    0.000000
3.000000    0.000000
4.000000    0.000000
5.000000    0.000000
6.000000    0.000000
7.000000    0.000000
8.000000    0.000000
9.000000    0.000000
Length: 10, dtype: float64

@vigji
Copy link
Collaborator Author

vigji commented Sep 16, 2023

Yes! (with semantic indexing as well).

If you then plan to go forward with this core change, I guess I'll halt this PR and wait that you merge new-core also for #165 and #169. Maybe we can keep the chat open just as an eventual discussion of features of the new core!

Happy to beta-test one it is at feature parity if it can be helpful!

@gviejo gviejo marked this pull request as draft September 18, 2023 17:52
@gviejo
Copy link
Contributor

gviejo commented Sep 18, 2023

Apparently it's not possible to convert a PR to an issue.
In any case, the branch new_core contains the new core of pynapple. The biggest changes are in pynapple/core/time_series.py.

Feel free to test it. I still need to update a lot of docs around it.

@vigji
Copy link
Collaborator Author

vigji commented Sep 19, 2023

Thanks, great! Docs aside, It is not already good right? I still have some testing errors, do not want to give feedback before it's time!

Minor request: do you think it would be possibile to add the option of passing data to time series as lists? it was one of the things I was adding in this PR, could be convenient!

@gviejo
Copy link
Contributor

gviejo commented Sep 19, 2023

Which tests failed? I am still working on test_numpy_compatibility.py but everything else should work.

@vigji
Copy link
Collaborator Author

vigji commented Sep 19, 2023

Edit: never mind, I was being stupid with forks synch :)

@vigji
Copy link
Collaborator Author

vigji commented Sep 19, 2023

Other feedbacks to smooth back-compatibility (maybe you are already addressing them for numpy compatibility though):

  • Apparently .copy() was working before and does not work now (AttributeError: 'Tsd' object has no attribute 'copy'). Maybe something like:
def copy(self) -> np.ndarray:
        return self.__class__(d=self.values.copy(), t=self.index.copy())

?

  • I used to do pd.concat() concatenation of Tsd subclasses, now I get with np.concatenate TypeError: no implementation found for 'numpy.concatenate' on types that implement __array_function__: [<class 'pynapple.core.time_series.Tsd'>]; can this be supported?

@gviejo
Copy link
Contributor

gviejo commented Sep 19, 2023

Yes for the copy method. I can add it now.

For pd.concat, it's a good question. My approach was to prevent any operations that would increase the shape or dims of the object so that you don't mess up with the time index. You can only slice down basically. But maybe it's too restrictive.
The problem with np.concatenate as I see it is that you would have to gather together time points that have the same time index and re-sort all time index. It might be a costly operation. Can you provide an example of the behavior you would expect?

@vigji
Copy link
Collaborator Author

vigji commented Sep 19, 2023

So, the usage I have is super basic, I just have files with consistent time indexes across each other (i.e. first 0-20, second 21-40, etc) and I want to join them, in my case no reindexing would be needed

@vigji
Copy link
Collaborator Author

vigji commented Sep 21, 2023

Question on the new indexing: would you consider the option of having strings interpreted as columns names? Like, being able to do tsd["a"] tsd.loc["a"], at least for single columns? Or the parse become too convoluted? Was there another considerations behind forcing the .loc accessing?

@gviejo
Copy link
Contributor

gviejo commented Sep 21, 2023

Yes that is doable.
Also for np.concatenate, I think it's possible and the safest thing to do in my opinion is to raise an error if indexes are not sorted or overlap.

@vigji
Copy link
Collaborator Author

vigji commented Sep 21, 2023

Yes that is doable.

Great!

Also for np.concatenate, I think it's possible and the safest thing to do in my opinion is to raise an error if indexes are not sorted or overlap.

Yep, fully agree!

@vigji
Copy link
Collaborator Author

vigji commented Sep 22, 2023

Also, it would be good to have the string indexing to support item assignment:

tsd["b"] = some_operation_on(tsd["a"])

@gviejo
Copy link
Contributor

gviejo commented Sep 22, 2023

The branch new_core should contains a working version. I will let it rest for a few days before bumping the version. Please let me know if you have troubles. I will also close this PR if it's ok with you.

@gviejo
Copy link
Contributor

gviejo commented Dec 15, 2023

Hi can I close this pull request? I think most of the features we discussed here have been implemented.

@vigji
Copy link
Collaborator Author

vigji commented Dec 15, 2023

Yup!

@gviejo gviejo closed this Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants