Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

Merged

Conversation

simonjayhawkins
Copy link
Member

@simonjayhawkins simonjayhawkins commented Feb 19, 2021

continuation of #36142

todo:

  • during the merge I removed the import gymnastics from pandas/core/arrays/string_arrow.py. We added since ArrowStringDtype should always be available. I think this is no longer the case and we want to move the import checks to StringDtype. I expect envs without pyarrow to fail for now.
  • test_arrow_roundtrip is failing. StringDtype.__from_arrow__ needs updating.
  • address @jorisvandenbossche comments Arrow string array dtype #36142 (comment) and Arrow string array dtype #36142 (comment)
  • some other test failures AssertionError: assert 'unknown-array' == 'string'

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Feb 19, 2021
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this started again!

Comment on lines +92 to +93
if storage is None:
storage = get_option("mode.string_storage")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to delay looking up this option and allow the StringDtype to be initialized with storage=None. That could eg allow doing s.astype("string") preserving the original dtype if the series already is of string dtype (regardless of the default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit (wip) started to add a parameterised fixture for more testing prior to implementing this.

since the dtype change is user facing this change worthwhile, there is still more to do as many tests use dtype="string" but does potentially make this PR harder to review. I will carry on and do the others if doing this makes sense.

Have added string[pyarrow] to this fixture. This gives some additional failures for outstanding parts of ArrowStringArray. It maybe that to keep this PR more cleanly scoped, that we defer adding that for a follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't think this comment is resolved, so "unresolved" it in the UI)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure can you open an issue explaining the api you'd like. I have updated _from_seqenece to accept the string "string", but if you are writing code if you do a comparison as a conditional, you expect the storage type to be fixed. and not then do something different. If the only places where this could be usefull is astype, we could put the logic there.

def __eq__(self, other: Any) -> bool:
if isinstance(other, str) and other == "string":
return True
return super().__eq__(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is actually the behaviour here? Are "string[python]" and "string[arrow]" seen as not equal, I suppose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Not sure what the behavior should be if we delay looking up the storage. that could complicate things so will leave for a follow-on after better understanding of what we want, and when the lookup would be final.

>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string")
... 
True
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string")
... 
True
>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string[python]")
... 
True
>>> with pd.option_context("string_storage", "python"):
...     print(StringDtype() == "string[pyarrow]")
... 
False
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string[python]")
... 
False
>>> with pd.option_context("string_storage", "pyarrow"):
...     print(StringDtype() == "string[pyarrow]")
... 
True
>>> StringDtype(storage="python") == "string"
True
>>> 
>>> StringDtype(storage="pyarrow") == "string"
True
>>> 
>>> StringDtype(storage="python") == StringDtype(storage="python")
True
>>> 
>>> StringDtype(storage="pyarrow") == StringDtype(storage="pyarrow")
True
>>> 
>>> StringDtype(storage="python") == StringDtype(storage="pyarrow")
False
>>> 
>>> StringDtype(storage="pyarrow") == "string[pyarrow]"
True
>>> 
>>> StringDtype(storage="pyarrow") == "string[python]"
False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is ok, as we consider dtypes to be exactly equal. later on maybe we can allow equivalent tests (possibly open an issue about this)


def __from_arrow__(
self, array: Union[pyarrow.Array, pyarrow.ChunkedArray]
) -> StringArray:
) -> ArrowStringArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't change to always ArrowStringArray, but still depend on what self.storage is (even when the data is coming from arrow, we still want to create python objects-backed string array if that's the default, I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> arr
<pyarrow.lib.ChunkedArray object at 0x7f45ffd5e090>
[
  [
    "a",
    "b",
    "c"
  ]
]
>>> 
>>> StringDtype(storage="python").__from_arrow__(arr)
<StringArray>
['a', 'b', 'c']
Length: 3, dtype: string[python]
>>> 
>>> StringDtype(storage="pyarrow").__from_arrow__(arr)
<ArrowStringArray>
['a', 'b', 'c']
Length: 3, dtype: string[pyarrow]
>>> 

try:
import pyarrow as pa
except ImportError:
pa = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed don't need to lazily import pyarrow anymore for the sake of the dtype object. But if we want to expose the ArrowStringArray in pandas.core.arrays like all other arrays, we are still going to need to keep this behind an ImportError ..

(not sure if we want to expose ArrowStringArray, but until that's decided I would maybe leave those imports as is)

# custom __eq__ so have to override __hash__
return super().__hash__()

# TODO: this is a classmethod, but we need to know the storage type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue about this is #36126. Tom had a PR #36136 with some discussion about it. Would need to look back, but I think the conclusion was indeed that StringDtype can just make this a normal method for now.

@jorisvandenbossche
Copy link
Member

@simonjayhawkins do you have time to further update this PR?

@simonjayhawkins
Copy link
Member Author

@simonjayhawkins do you have time to further update this PR?

I really need to walk away from typing activity for a few weeks, it's a time sink (and depressing.) we should discuss in the dev-meeting later today.

@jorisvandenbossche
Copy link
Member

@simonjayhawkins a friendly ping here

@simonjayhawkins
Copy link
Member Author

@simonjayhawkins a friendly ping here

Thanks @jorisvandenbossche

merge master and pushed changes. not ready for review.

these failures to fix...

=================================================== short test summary info ===================================================
FAILED pandas/tests/arrays/string_/test_string.py::test_arrow_roundtrip[pyarrow] - AttributeError: 'StringArray' object has ...
FAILED pandas/tests/arrays/string_/test_string.py::test_arrow_roundtrip[python] - AttributeError: 'StringArray' object has n...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data0-False-Series] - AssertionError: ass...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data1-False-Series] - AssertionError: ass...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data0-True-array] - AssertionError: asser...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data0-False-array] - AssertionError: asse...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data1-True-array] - AssertionError: asser...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data1-False-array] - AssertionError: asse...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data1-True-Series] - AssertionError: asse...
FAILED pandas/tests/dtypes/test_inference.py::TestTypeInference::test_string_dtype[data0-True-Series] - AssertionError: asse...
FAILED pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_additional_extension_arrays - AttributeError: 'StringArray'...
FAILED pandas/tests/io/test_parquet.py::TestParquetPyArrow::test_use_nullable_dtypes - AttributeError: 'StringArray' object ...
============== 13 failed, 157195 passed, 3567 skipped, 992 xfailed, 6 xpassed, 219 warnings in 465.64s (0:07:45) ==============

@jorisvandenbossche
Copy link
Member

Looking at the error messages, you need to update the __from_arrow__ method for StringDtype. It hardcodes the use of StringArray._concat_same_type, but the class should depend on the dtype's backend

@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Mar 28, 2021

@jorisvandenbossche The fails in #39908 (comment) are fixed.

we now have these failures...

=================================================== short test summary info ===================================================
FAILED pandas/tests/arrays/string_/test_string.py::test_string_methods[string-input0-method0] - AssertionError: assert 'stri...
FAILED pandas/tests/arrays/string_/test_string.py::test_string_methods[string-input1-method1] - AssertionError: assert 'stri...
FAILED pandas/tests/arrays/string_/test_string.py::test_string_methods[string-input2-method2] - AssertionError: assert 'stri...
FAILED pandas/tests/extension/test_floating.py::TestCasting::test_astype_string[string[pyarrow]-Float32Dtype] - AssertionErr...
FAILED pandas/tests/extension/test_string.py::TestCasting::test_astype_string[string[pyarrow]-python-False] - TypeError: Can...
================================= 6 failed, 25 passed, 29311 deselected, 2 warnings in 6.82s ==================================

i need to make a start on the str accessor (separate PR). so will do that before investigating the first three errors

the last 2 will be either xfailed or disappear for now if we remove string[pyarrow] from the fixture and add in a follow-on instead. #39908 (comment)

EDIT: removed fail that I always get locally that is not related.

@simonjayhawkins simonjayhawkins changed the title Arrow string array dtype API: [ArrowStringArray] StringDtype parameterized by storage (python or pyarrow) Mar 29, 2021
@simonjayhawkins
Copy link
Member Author

the last 2 will be either xfailed or disappear for now if we remove string[pyarrow] from the fixture and add in a follow-on instead. #39908 (comment)

have opened #40679 as a pre-cursor to remove these changes from this PR

@simonjayhawkins
Copy link
Member Author

-1000 on delaying 1.3

I'm not proposing delaying 1.3. we could go ahead without this.

i disagree the storage is certainly part of the type

its an implementation detail. should be able to convert between StringArray and ArrowSringArray and vice-versa without data loss.

eg if we don't expose these then it's always a global which is a bad idea

the storage could be attribute of StringArray and storage an argument of the constructor. The array constructors is the only method where the StringArray and ArrowStringArray differ, these could be unified with this argument.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2021

i would merge this as is

we close options if it's changed to be a single type

it's multiple types - the storage is not an implementation detail unless they r exactly the same and they r not

maybe in the future but not now

@simonjayhawkins
Copy link
Member Author

it's multiple types - the storage is not an implementation detail unless they r exactly the same and they r not

@jorisvandenbossche questioned whether string[python] == string[pyarrow] #39908 (comment)

Maybe a DataFrame/Series holding string data in pyarrow format should compare equal to another holding exactly the same data using the python memory model.

where we may want to consider mutliple types for storage is ascii/unicode as conversions between these would not be lossless.

i would merge this as is

yes. this seems to be a working implementation. @jorisvandenbossche does seem to have some outstanding concerns which I think the proposed alternative could address.

maybe in the future but not now

i'll mock up a POC in the next few days anyway. You have a good point about needing to use the global option to create pyarrow backed string arrays and can use the POC to look into this further (i.e. maybe we could pass kwargs from pd.array onto the array constructor and add a kwarg to _from_sequence)

@simonjayhawkins
Copy link
Member Author

i would merge this as is

a couple of outstanding points #39908 (comment) and #39908 (comment)

these could maybe a follow-up

@simonjayhawkins simonjayhawkins marked this pull request as ready for review June 5, 2021 14:28
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Jun 5, 2021
@jorisvandenbossche
Copy link
Member

I think we should maybe consider use composition in StringArray using ObjectStringArray and ArrowStringArray so that the user facing array type is always StringArray. The implementation detail could perhaps be hidden from the user completely. ... This would be a major 11th hour change so we may need to defer from 1.3

I don't have a good idea whether composition would be nice, but just to point out: ArrowStringArray is explicitly labeled as "experimental", so I don't think we need to worry about this right now for 1.3. We can still change it afterwards if that turns out to be a better option.
(all the methods on ArrowStringArray are consistent with StringArray, so that wouldn't change anything. The only "public" API entrypoint is the ArrowStringArray(..) constructor, and in principle no user should be using that (and that would also be easy to keep as backwards-compatible shim)).

All to say: I don't think we need to consider leaving this out for 1.3 release because of this open question. The exact array class is mostly an implementation detail we can still change later.

We could be on the wrong track with the parametrization of the dtype by storage.

Even if we would have a single public StringArray through composition, wouldn't we keep the parametrization of the dtype? We still need to store the information about which storage to use somewhere, and we still need to give the user some control over it if they want to specify it explicitly. A parameter on the dtype seems fine for this.
Yes, it would also be a property of the StringArray, but that makes it a lot harder to give a nice way to power users to control it.

@jorisvandenbossche does seem to have some outstanding concerns which I think the proposed alternative could address.

I think my main open comment is about the potential "deferred" storage mode for StringDtype(..). While I would like to see that discussed, I don't think that's worth it to block this PR for (it's mostly an implementation detail on eg how to deal with things like astype("string"))

@simonjayhawkins
Copy link
Member Author

We could be on the wrong track with the parametrization of the dtype by storage.

Even if we would have a single public StringArray through composition, wouldn't we keep the parametrization of the dtype? We still need to store the information about which storage to use somewhere, and we still need to give the user some control over it if they want to specify it explicitly. A parameter on the dtype seems fine for this.
Yes, it would also be a property of the StringArray, but that makes it a lot harder to give a nice way to power users to control it.

I think user control over the storage used is the sticking point if it's not on the dtype. I have updated #40962 and it works well apart from the user control. tbh I prefer that implementation as basically the storage is controlled by the global option. I think it better represents the original intent of the StringArray.

Maybe a 3rd storage option, "auto" to use pyarrow backed storage if pyarrow > 1.0.0 is installed.

@jorisvandenbossche
Copy link
Member

Maybe a 3rd storage option, "auto" to use pyarrow backed storage if pyarrow > 1.0.0 is installed.

That's something that we can still add later? So no need to consider that now?

@simonjayhawkins
Copy link
Member Author

Maybe a 3rd storage option, "auto" to use pyarrow backed storage if pyarrow > 1.0.0 is installed.

That's something that we can still add later? So no need to consider that now?

sure.

@jorisvandenbossche
Copy link
Member

Bringing the "deferred" storage mode lookup for StringDtype discussion (originally here: #39908 (comment)) in the main thread, and trying to recap.

Currently, doing pd.StringDtype() (without specifying the storage), will already look up the option. In the default case, you get:

>>> pd.StringDtype().storage
'python'

which also means that pandas_dtype() already "fully initializes" the string dtype:

>>> pd.api.types.pandas_dtype("string")
string[python]

As a consequence, doing astype("string") will actually convert the values if your string dtype doesn't match the globab setting:

>>> s = pd.Series(['a', 'b'], dtype=pd.StringDtype(storage="pyarrow"))
>>> s.dtype
string[pyarrow]
>>> s.astype("string").dtype
string[python]

While I think it could make sense for .astype("string") to mean: "ensure I have a string dtype", and thus don't convert to another storage backend if I already had a string dtype to start with.

We do something similar for CategoricalDtype ("category" means a categorical dtype with no categories, but astype("category") does not remove your categories, it preserves any existing categorical dtype as is).

We could still have the astype("string") behave in the way I suggest by special casing this in the astype implementations (as suggested in #39908 (comment))), but I think that's something we would ideally avoid (any astype implementation accepting string dtype values as input would need to handle this case?)

@jreback
Copy link
Contributor

jreback commented Jun 7, 2021

@simonjayhawkins let's open an issue about @jorisvandenbossche comment #39908 (comment)

I agree we shouldn't astype something which is already a 'string' dtype even if its storage is different (though maybe need to work thru this case).

if not objections i think let's merge this.

@simonjayhawkins
Copy link
Member Author

I agree we shouldn't astype something which is already a 'string' dtype even if its storage is different (though maybe need to work thru this case).

I agree this makes sense. just how to do it.

(In #40962, the dtype check for astype is in the StringArray before dispatching to ObjectStringArray or PythonStringArray to achieve this. although there is no parameterisation of the dtype so not a like for like comparison)

will copy @jorisvandenbossche comment to a new issue.

@simonjayhawkins
Copy link
Member Author

if not objections i think let's merge this.

no objections, without the storage on the dtype it's very difficult for the user to control. And internally where arrays are reconstructruced we use the dtype, so a pyarrow backed string array could easily become a object backed string array (e.g. in the str accessor methods that don't yet dispatch.)

If the user control was not an issue and we only had one of the two dtypes active at any time though the global option , i'd advocate for a single dtype

The implementation detail would be hidden. pyarrow backed string arrays and object backed string arrays could compare equal, always roundtrip using this definition and _concat_same_type would treat them as same type. More importantly, we would not make ArrowStringArray public, but this may still be achievable with the parameterized dtype. will update (merge master) #40962 after this is merged and see how well it works with the parameterized dtype

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche OK to merge?

@jorisvandenbossche
Copy link
Member

The failures seem unrelated? (codecov upload failure or so)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are still some follow-ups to do, but I think this is more than good enough for an initial PR with the parametrized dtype.

@jorisvandenbossche jorisvandenbossche merged commit 9d77a56 into pandas-dev:master Jun 8, 2021
@jorisvandenbossche
Copy link
Member

Thanks a lot @simonjayhawkins (and @xhochy and @TomAugspurger for the precursors) !

@simonjayhawkins simonjayhawkins deleted the arrow-string-array-dtype branch June 8, 2021 13:39
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 8, 2021 via email

@jorisvandenbossche
Copy link
Member

@simonjayhawkins one other aspect I just realized isn't implemented here yet, is StringDtype._get_common_dtype. Which means that combining the two types of the dtype gives object dtype:

In [1]: s1 = pd.Series(['a', 'b'], dtype="string[python]")

In [2]: s2 = pd.Series(['c', 'd'], dtype="string[pyarrow]")

In [3]: pd.concat([s1, s2])
Out[3]: 
0    a
1    b
0    c
1    d
dtype: object

While this at least should preserve the string dtype, giving priority to one of both (which one is something to decide, or can depend on the order which is passed first).

@simonjayhawkins
Copy link
Member Author

While this at least should preserve the string dtype, giving priority to one of both (which one is something to decide, or can depend on the order which is passed first)

yep. I raise NotImplementedError in #40962 if mixed storage types are passed to _concat_same_type, since I had the same question

I think the two options are as you mention, which is passed first but I think maybe better to use the global storage for the combined result.

There's also an option of perhaps using the more common type, but that's probably additional complexity without adding any value to the user.

The first two are more deterministic.

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
…or pyarrow) (pandas-dev#39908)

Co-authored-by: Uwe L. Korn <[email protected]>
Co-authored-by: Tom Augspurger <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants