Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add Numpy Array Interface support to Pandas objects #8321

Closed
wants to merge 1 commit into from

Conversation

dalejung
Copy link
Contributor

WIP

@dalejung dalejung force-pushed the numpy_array_interface branch from f394229 to c29c051 Compare September 19, 2014 17:18
@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

  • pls run a perf check on this as well as its hits some critical paths.
  • tests!

@dalejung
Copy link
Contributor Author

oops yeah, didn't mean to imply this was done.

I actually threw this up real quick cuz I'm not sure the helper func belongs in common.py. afaik common.py shouldn't know about NDFrame, though maybe it's okay because the import happens within the function.

I need a guard against everything that implements the array interface, but that we explicitly handle. Right now, that appears to be NDFrame and ndarray.

@dalejung
Copy link
Contributor Author

So I found #5698.

I think the confusion lies in that __array__, __array_interface__, __array_struct__ all appear to make objects array-like. AFAIK, __array__ is meant for subclasses and is not the array interface per se, it's just that numpy doesn't appear to check.

I think we're supposed to implement the __array_interface__.

@dalejung dalejung force-pushed the numpy_array_interface branch 2 times, most recently from 25f0087 to c8b7eff Compare September 21, 2014 20:46
@shoyer
Copy link
Member

shoyer commented Sep 22, 2014

I know the numpy docs say that these methods are for ndarray subclasses, but my experience writing a handful of ndarray-like objects (non-subclasses) is that numpy checks for these attributes (duck typing) instead of checking for actual ndarray subclasses. I would consider this a documentation bug for numpy.

__array_interface__ appears to be design for low-level inoperability with numpy.ndarrays, and, as the docs say appears to be largely supplanted by the revised buffer protocol.

So, roughly speaking, these are two different ndarray interfaces that serve different needs. Are you need/want to be checking for the later? Maybe you could describe what exactly you are trying to do with this? :)

@dalejung
Copy link
Contributor Author

@shoyer +1 on the numpy docs being buggy! Yeah, looking through the numpy code it looks like checks for array-likeness in one big function through a series of if statements. I can use __array_interface__ with __array_wrap__ and everything is fine. I should've linked the original issue #8311.

I put the check above our big if block since DataFrame has explicit handling for ndarray. If it was just a no-op like in Series I would've just put it in the else. If I put it in the DataFrame else, I'd have to copypasta replicate the np.ndarray handling.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

I think you should simply use __array__, see here: #7270

This would technicaly also allow Python Array classes too

the others are for 'c' checks IIRC (never clear on how this works)

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@dalejung and would be nice to change the check in #7270 to integrate with yours (as a single function).

we have is_list_like, add is_array_like

@jreback jreback added API Design Dtype Conversions Unexpected or buggy dtype conversions labels Sep 22, 2014
@dalejung
Copy link
Contributor Author

@jreback yeah, for a long time I've only really known __array__. I only went through the other methods because of my tests breaking with rpy2 compat. I don't think it's rpy2 that's implementing things wrong, just the numpy api has features that overlap.

I'll look through and see how we're using __array__ for checks. Might be worthwhile to just create an ABCNumpyArray or something to replace checking for ndarray and its cousins.

@dalejung
Copy link
Contributor Author

@jreback so I notice that there was an effort to start abstracting away our dependency on numpy. I'm in need of the np.array equivalent inside of core/array.py. It's just a simple pointer but what should it be called? Do we want a pa.array when we have pa.Array?

@shoyer
Copy link
Member

shoyer commented Sep 22, 2014

My 2 cents is that it's not worth much effort to abstract away our dependency on numpy until there are any even remotely plausible alternatives in sight. Until then it's very hard to say even exactly what we will need.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@dalejung their was some work be @wesm to abstract the importing of numpy. That is mostly boilerplate (and should be removed, I'll make a separate issue for that).

@dalejung dalejung force-pushed the numpy_array_interface branch 4 times, most recently from 777a458 to ea11f59 Compare September 24, 2014 05:08
TST: test handling Numpy Array Interface, also explicitly test handling
rpy2 objects

Moved logic to is_array_like, added __array__ check

changed pandas check from NDFrame to pandasobject

updated with array_like and fixed rpy2 tests
@dalejung dalejung force-pushed the numpy_array_interface branch from ea11f59 to db00a07 Compare September 24, 2014 05:09
@dalejung
Copy link
Contributor Author

Here's a gist with 3 vbench runs. https://gist.github.com/dalejung/cdf7e3f059f0f80f6486

nothing really sticks out to me, it kind of jumps around with different tests showing up as slow.

----
Remember that ndarrays and NDFrames are array-like.
"""
# numpy ndarray subclass api
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this function could just be:

array_like_attrs = ['__array__', '__array_interface__', '__array_struct__']
return any(hasattr(obj, attr) for attr in array_like_attrs)

I don't think you need to check the types of the attributes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's only appropriate to check for 'array' the others are not necessarily and c-level anyhow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is true. PIL exposes its data to numpy via __array_interface__ and does not have a __array__. rpy2 implements __array_struct__ and not __array__. Looking around github there are projects that only implement __array_interface__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nobody writes ndarray-like objects to spec :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell me about it. Problem is, it's hard to detangle should work vs does work. At this point, they are probably the same thing since there's enough code out there depending on current behaviors. :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too complicated. isn't this:

return any([ hasattr(obj, attr) for attr in ['__array__','__array_interface__','__array_true'] ])

equivalent / simplier / faster? (this is going to be called a lot, pls show a perf check as well)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback look at my first comment in this thread!

also, you definitely want to do the generator compression any(...) rather than the list compression any([...]) (the later would will always do every check).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer I did see that but forgot :)

yes, @dalejung I think that expression is the correct one (with generators)

@jreback
Copy link
Contributor

jreback commented Sep 26, 2014

@dalejung I think to make things consistent, can you also _is_sequence -> is_sequence_like in the code base..thxs

if you happend to document this as well would be gr8!

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@dalejung can you revisit. I like some of the things going on here, but this devolved a bit.

@jreback
Copy link
Contributor

jreback commented May 9, 2015

closing pls reopen if/when updated

@jreback jreback closed this May 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants