-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas MultiIndex series "unstack" to scipy sparse array functionality #8048
Comments
Do you have any short, concrete examples of what you're doing? Also let us know why using a SparseSeries doesn't work for it. |
Here is a quick example of going from a scipy.sparse array to a pandas Series and back. Mostly I am thinking of the case where you are data munging and are reading in labeled data and need to then switch to a sparse matrix in order to use something in scipy.sparse (multiplication, sparse svds or whatever). I am running this on 0.14.1. Apologies if this functionality is available on the dev branch. I have seen related question on stackoverflow but they all seem to indicate this is not yet implemented and that no one has given a strong opinion that it should be included either (maybe there is a good reason to avoid this kind of thing).
|
Thanks for the example. I agree with your intuition that |
@cottrell This would definitly be an improvements. See the related issue I linked to. I think several of the sparse routines could easily have an option to return a scipy coo type (or other sparse type matrix). It should be straightforward from a pandas sparse structure (which are very similar to coo types). |
Here's an idea of the internal structure.
|
Cool. I am not quite clear on whether the pandas.sparse structure should be the attachment point of some sort of unstack_to_scipy_sparse (this is a bad name) routine or not. Which do you think would make more sense (consider Series only for simplicity):
Basically, I think unstacking to scipy.sparse can happen with or without pandas.sparse storage. |
I think it should go from a sparse structure. if you have a series it is dense by definition. (wether it is unstack from something else or not). The key is you can efficiency translate a pandas sparse structure to scipy w/o densing. any else doesn't make much sense. |
@cottrell Think of |
I would like to know that, as v0.15.2 has been released, is this feature implemented? |
@byronyi I don't think so -- that's why this issue is still open! :) |
As a suggestion, you can simply do it by this: from scipy.sparse import coo_matrix
coo = coo_matrix((series.values, zip(*series.index.values))) Given the series has two-level integer multi-index. But we need to deal with the fact when the index is not integer, and some kind of mapping is necessary. |
I think that implementing this will be kind of tricky when index is not numerical (or even not starting from 0). But nevertheless, we can still put it on documentation of the Sparse page so people don't bother asking for answers again. |
@byronyi doc pull-requests are welcome! sparse is kind of a neglected step-child ATM. need some interest from contributors! will help you along. lmk. |
I have come back to this a few times and have yet to settle on what exactly is the right feature to implement (and haven't had much time to really play around unfortunately). One thing I should point out is that, as far as I know, there is no such thing as an n-dim sparse scipy array so I was having a false memory when I first wrote the comment above. I think there might be two separate features here?
I think 2 is simple and I will hopefully try this soon but I am wondering if there is an simple method of handling the stack/unstack within the sparse framework. The only thing I could see quickly is to use the sparse constructor but I am guessing that is not the right way to go. |
I hacked together something together to demo what a SparseSeries.to_coo might behave like (i.e. point 2 above but for Series). If this looks like it is moving in the right direction let me know and I can try to take this a little further. https://gist.github.com/cottrell/a17fa777afd2cc4a7289 |
@cottrell For your feature (1), do you really mean "SparseDataFrame.stack/unstack -> SparseDataFrame" or should that last SparseDataFrame be a SparsePanel? If so, I understand and agree with your two features. It would indeed be nice to have an n-dimensional sparse scipy array -- too bad that doesn't exist! It would be an interesting side project to make that. Your gist looks like roughly the right direction to me -- though we'll want to break that |
@cottrell you approach looks reasonable. Having a pls do a PR and we can have a look at the impl. Other things that will be necessary:
Futher, related to #4343 I think it would be straightfoward to have the This would be especially useful for testing. You can implement this with a |
@shoyer Re: "that last SparseDataFrame be a SparsePanel" ... Does stack/unstack (on non-sparse DataFrames) ever take you to Panels? I only ever use stack/unstack to reshape values of DataFrame and modify the (MultiIndex) columns and indices. |
@cottrell I was confused. You are correct. |
I've created a PR as requested. #9076 There is still some work to do (haven't updated docs yet, for example) but it would be good to get some feedback. Also, am having trouble with Travis CI failures as of this weekend. Even 15.2 appears to be failing now. |
closed by #9076 |
Does anyone object to Discuss in #15634. |
I haven't read through the new SparseDataFrame api yet but I think the main convenience with the _coo methods was for the to_coo in the case len(row_levels)>1 or len(column_level)>1 where you need to effectively turn an index of tuples into a single index and get at the codes. There are probably better ways to do this by directly accessing the internals (index labels and hashing tricks on arrays). For my understanding, is there a replacement function that passes the tests for to_coo and only uses the SparseDataFrame api or would this be a bit of a feature drop (probably fine if no one else is using this stuff)? |
@cottrell sorry for the delay. Scipy sparse only supports 2d matrices, so with a multi-level indexed series, one would first transform/unstack into a 2d sparse dataframe and then call # Old .to_coo()
>>> ss = pd.SparseSeries(
... [3.0, 1.0, 3.0],
... index=pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
... (1, 1, 'b', 0),
... (1, 1, 'b', 1)],
... names=['A', 'B', 'C', 'D']))
>>> ss
A B C D
1 2 a 0 3.0
1 b 0 1.0
1 3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
>>> A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
... column_levels=['C', 'D'],
... sort_labels=True)
>>> A.toarray()
array([[ 0., 1., 3.],
[ 3., 0., 0.]])
>>> rows
[(1, 1), (1, 2)]
>>> columns
[('a', 0), ('b', 0), ('b', 1)]
# Instead, we could ... (new)
>>> ss2 = ss.copy()
# ... use any means to make a two-level index
>>> ss2.index = pd.MultiIndex.from_tuples([(v[:2], v[2:])
... for v in ss.index.values],
... names=['AB', 'CD'])
>>> ss2
AB CD
(1, 2) (a, 0) 3.0
(1, 1) (b, 0) 1.0
(b, 1) 3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
# Would work either way because series unstacks into a sparse data frame ...
>>> sdf = ss2.unstack()
>>> sdf
Out[34]:
CD (a, 0) (b, 0) (b, 1)
AB
(1, 1) NaN 1.0 3.0
(1, 2) 3.0 NaN NaN
# ... which has .to_coo()
>>> A = sdf.to_coo()
>>> A
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
>>> A.toarray()
array([[ 0., 1., 3.],
[ 3., 0., 0.]])
>>> sdf.index # rows
Index([(1, 1), (1, 2)], dtype='object', name='AB')
>>> sdf.columns.tolist() # columns, as a list
[('a', 0), ('b', 0), ('b', 1)] Unstacking |
Yeah, that functionally seems to cover it. I have a feeling from_tuples is pretty bad performance wise. It seems like all the usefulness of that features really comes down to just being able to efficiently create codes for groups of levels (merging levels like you show above). After that, it is just record keeping to get the labels. Do you know if there is a more efficient version of something like this somewhere in the pandas code base?
|
@cottrell looks like you need generalized data hashing, try
|
related #4343
I have found myself on occasion writing code to convert from a pandas Series (with n-level MultiIndex) to a scipy sparse n-dim array (plus dimension labels) and back. I am not terribly familiar with the newer sparse data structure features but my impression is that it provides a totally different kind of functionality. Does this kind of functionality exist somewhere? If not, would something like this be of interest?
The text was updated successfully, but these errors were encountered: