-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this started again!
if storage is None: | ||
storage = get_option("mode.string_storage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to delay looking up this option and allow the StringDtype to be initialized with storage=None
. That could eg allow doing s.astype("string")
preserving the original dtype if the series already is of string dtype (regardless of the default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit (wip) started to add a parameterised fixture for more testing prior to implementing this.
since the dtype change is user facing this change worthwhile, there is still more to do as many tests use dtype="string"
but does potentially make this PR harder to review. I will carry on and do the others if doing this makes sense.
Have added string[pyarrow]
to this fixture. This gives some additional failures for outstanding parts of ArrowStringArray. It maybe that to keep this PR more cleanly scoped, that we defer adding that for a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I don't think this comment is resolved, so "unresolved" it in the UI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure can you open an issue explaining the api you'd like. I have updated _from_seqenece to accept the string "string", but if you are writing code if you do a comparison as a conditional, you expect the storage type to be fixed. and not then do something different. If the only places where this could be usefull is astype, we could put the logic there.
def __eq__(self, other: Any) -> bool: | ||
if isinstance(other, str) and other == "string": | ||
return True | ||
return super().__eq__(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually the behaviour here? Are "string[python]" and "string[arrow]" seen as not equal, I suppose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Not sure what the behavior should be if we delay looking up the storage. that could complicate things so will leave for a follow-on after better understanding of what we want, and when the lookup would be final.
>>> with pd.option_context("string_storage", "python"):
... print(StringDtype() == "string")
...
True
>>> with pd.option_context("string_storage", "pyarrow"):
... print(StringDtype() == "string")
...
True
>>> with pd.option_context("string_storage", "python"):
... print(StringDtype() == "string[python]")
...
True
>>> with pd.option_context("string_storage", "python"):
... print(StringDtype() == "string[pyarrow]")
...
False
>>> with pd.option_context("string_storage", "pyarrow"):
... print(StringDtype() == "string[python]")
...
False
>>> with pd.option_context("string_storage", "pyarrow"):
... print(StringDtype() == "string[pyarrow]")
...
True
>>> StringDtype(storage="python") == "string"
True
>>>
>>> StringDtype(storage="pyarrow") == "string"
True
>>>
>>> StringDtype(storage="python") == StringDtype(storage="python")
True
>>>
>>> StringDtype(storage="pyarrow") == StringDtype(storage="pyarrow")
True
>>>
>>> StringDtype(storage="python") == StringDtype(storage="pyarrow")
False
>>>
>>> StringDtype(storage="pyarrow") == "string[pyarrow]"
True
>>>
>>> StringDtype(storage="pyarrow") == "string[python]"
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is ok, as we consider dtypes to be exactly equal. later on maybe we can allow equivalent tests (possibly open an issue about this)
pandas/core/arrays/string_.py
Outdated
|
||
def __from_arrow__( | ||
self, array: Union[pyarrow.Array, pyarrow.ChunkedArray] | ||
) -> StringArray: | ||
) -> ArrowStringArray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't change to always ArrowStringArray, but still depend on what self.storage
is (even when the data is coming from arrow, we still want to create python objects-backed string array if that's the default, I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> arr
<pyarrow.lib.ChunkedArray object at 0x7f45ffd5e090>
[
[
"a",
"b",
"c"
]
]
>>>
>>> StringDtype(storage="python").__from_arrow__(arr)
<StringArray>
['a', 'b', 'c']
Length: 3, dtype: string[python]
>>>
>>> StringDtype(storage="pyarrow").__from_arrow__(arr)
<ArrowStringArray>
['a', 'b', 'c']
Length: 3, dtype: string[pyarrow]
>>>
pandas/core/arrays/string_arrow.py
Outdated
try: | ||
import pyarrow as pa | ||
except ImportError: | ||
pa = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed don't need to lazily import pyarrow anymore for the sake of the dtype object. But if we want to expose the ArrowStringArray in pandas.core.arrays
like all other arrays, we are still going to need to keep this behind an ImportError ..
(not sure if we want to expose ArrowStringArray, but until that's decided I would maybe leave those imports as is)
pandas/core/arrays/string_.py
Outdated
# custom __eq__ so have to override __hash__ | ||
return super().__hash__() | ||
|
||
# TODO: this is a classmethod, but we need to know the storage type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins do you have time to further update this PR? |
I really need to walk away from typing activity for a few weeks, it's a time sink (and depressing.) we should discuss in the dev-meeting later today. |
@simonjayhawkins a friendly ping here |
Thanks @jorisvandenbossche merge master and pushed changes. not ready for review. these failures to fix...
|
Looking at the error messages, you need to update the |
@jorisvandenbossche The fails in #39908 (comment) are fixed. we now have these failures...
i need to make a start on the str accessor (separate PR). so will do that before investigating the first three errors the last 2 will be either xfailed or disappear for now if we remove EDIT: removed fail that I always get locally that is not related. |
have opened #40679 as a pre-cursor to remove these changes from this PR |
I'm not proposing delaying 1.3. we could go ahead without this.
its an implementation detail. should be able to convert between StringArray and ArrowSringArray and vice-versa without data loss.
the storage could be attribute of StringArray and storage an argument of the constructor. The array constructors is the only method where the StringArray and ArrowStringArray differ, these could be unified with this argument. |
i would merge this as is we close options if it's changed to be a single type it's multiple types - the storage is not an implementation detail unless they r exactly the same and they r not maybe in the future but not now |
@jorisvandenbossche questioned whether Maybe a DataFrame/Series holding string data in pyarrow format should compare equal to another holding exactly the same data using the python memory model. where we may want to consider mutliple types for storage is ascii/unicode as conversions between these would not be lossless.
yes. this seems to be a working implementation. @jorisvandenbossche does seem to have some outstanding concerns which I think the proposed alternative could address.
i'll mock up a POC in the next few days anyway. You have a good point about needing to use the global option to create pyarrow backed string arrays and can use the POC to look into this further (i.e. maybe we could pass kwargs from pd.array onto the array constructor and add a kwarg to _from_sequence) |
a couple of outstanding points #39908 (comment) and #39908 (comment) these could maybe a follow-up |
I don't have a good idea whether composition would be nice, but just to point out: All to say: I don't think we need to consider leaving this out for 1.3 release because of this open question. The exact array class is mostly an implementation detail we can still change later.
Even if we would have a single public StringArray through composition, wouldn't we keep the parametrization of the dtype? We still need to store the information about which storage to use somewhere, and we still need to give the user some control over it if they want to specify it explicitly. A parameter on the dtype seems fine for this.
I think my main open comment is about the potential "deferred" storage mode for |
I think user control over the storage used is the sticking point if it's not on the dtype. I have updated #40962 and it works well apart from the user control. tbh I prefer that implementation as basically the storage is controlled by the global option. I think it better represents the original intent of the StringArray. Maybe a 3rd storage option, "auto" to use pyarrow backed storage if pyarrow > 1.0.0 is installed. |
That's something that we can still add later? So no need to consider that now? |
sure. |
Bringing the "deferred" storage mode lookup for Currently, doing >>> pd.StringDtype().storage
'python' which also means that >>> pd.api.types.pandas_dtype("string")
string[python] As a consequence, doing >>> s = pd.Series(['a', 'b'], dtype=pd.StringDtype(storage="pyarrow"))
>>> s.dtype
string[pyarrow]
>>> s.astype("string").dtype
string[python] While I think it could make sense for We do something similar for CategoricalDtype ( We could still have the |
@simonjayhawkins let's open an issue about @jorisvandenbossche comment #39908 (comment) I agree we shouldn't astype something which is already a 'string' dtype even if its storage is different (though maybe need to work thru this case). if not objections i think let's merge this. |
I agree this makes sense. just how to do it. (In #40962, the dtype check for astype is in the StringArray before dispatching to ObjectStringArray or PythonStringArray to achieve this. although there is no parameterisation of the dtype so not a like for like comparison) will copy @jorisvandenbossche comment to a new issue. |
no objections, without the storage on the dtype it's very difficult for the user to control. And internally where arrays are reconstructruced we use the dtype, so a pyarrow backed string array could easily become a object backed string array (e.g. in the str accessor methods that don't yet dispatch.) If the user control was not an issue and we only had one of the two dtypes active at any time though the global option , i'd advocate for a single dtype The implementation detail would be hidden. pyarrow backed string arrays and object backed string arrays could compare equal, always roundtrip using this definition and _concat_same_type would treat them as same type. More importantly, we would not make ArrowStringArray public, but this may still be achievable with the parameterized dtype. will update (merge master) #40962 after this is merged and see how well it works with the parameterized dtype |
@jorisvandenbossche OK to merge? |
The failures seem unrelated? (codecov upload failure or so) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are still some follow-ups to do, but I think this is more than good enough for an initial PR with the parametrized dtype.
Thanks a lot @simonjayhawkins (and @xhochy and @TomAugspurger for the precursors) ! |
Thanks Simon!
…On Tue, Jun 8, 2021 at 8:37 AM Joris Van den Bossche < ***@***.***> wrote:
Thanks a lot @simonjayhawkins <https://github.com/simonjayhawkins> (and
@xhochy <https://github.com/xhochy> and @TomAugspurger
<https://github.com/TomAugspurger> for the precursors) !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39908 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIVIO4SO66WDI7ARAJTTRYMJRANCNFSM4X4NQGBQ>
.
|
@simonjayhawkins one other aspect I just realized isn't implemented here yet, is
While this at least should preserve the string dtype, giving priority to one of both (which one is something to decide, or can depend on the order which is passed first). |
yep. I raise NotImplementedError in #40962 if mixed storage types are passed to _concat_same_type, since I had the same question I think the two options are as you mention, which is passed first but I think maybe better to use the global storage for the combined result. There's also an option of perhaps using the more common type, but that's probably additional complexity without adding any value to the user. The first two are more deterministic. |
…or pyarrow) (pandas-dev#39908) Co-authored-by: Uwe L. Korn <[email protected]> Co-authored-by: Tom Augspurger <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]>
continuation of #36142
todo:
pandas/core/arrays/string_arrow.py
. We added sinceArrowStringDtype
should always be available. I think this is no longer the case and we want to move the import checks toStringDtype
. I expect envs without pyarrow to fail for now.test_arrow_roundtrip
is failing.StringDtype.__from_arrow__
needs updating.AssertionError: assert 'unknown-array' == 'string'