Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
4cb60e6
Implement BaseDtypeTests for ArrowStringDtype
xhochy Jul 10, 2020
d242f2d
Refactor to use parametrized StringDtype
TomAugspurger Sep 3, 2020
d39ab2c
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Feb 18, 2021
2367810
abs-imports
simonjayhawkins Feb 18, 2021
9166d3b
post merge fixup
simonjayhawkins Feb 19, 2021
8760705
StringDtype[python] -> string[python]
simonjayhawkins Feb 19, 2021
d5b3fec
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 22, 2021
2c657df
pre-commit fix for inconsistent use of pandas namespace
simonjayhawkins Mar 22, 2021
647a6c2
fix typo
simonjayhawkins Mar 22, 2021
0596fd7
pre-commit fixup - undefined name 'ArrowStringDtype'
simonjayhawkins Mar 22, 2021
c5a19c5
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 26, 2021
99680c9
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 28, 2021
69a6cc1
"StringDtype[storage]" -> "string[storage]" misc
simonjayhawkins Mar 28, 2021
bd147ba
__from_arrow__
simonjayhawkins Mar 28, 2021
830275f
more testing (wip)
simonjayhawkins Mar 28, 2021
214e524
fix inference
simonjayhawkins Mar 28, 2021
c9ba03c
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Mar 29, 2021
7425536
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 1, 2021
68ac391
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 1, 2021
5cfa97a
post-merge fixup
simonjayhawkins Apr 1, 2021
74dbf96
remove changes to test_string_dtype - broken off in #40725
simonjayhawkins Apr 1, 2021
3985943
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 15, 2021
3bda421
post merge fix-up
simonjayhawkins Apr 15, 2021
0c108a4
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 15, 2021
523e24c
post merge fix-up
simonjayhawkins Apr 15, 2021
279624c
revert some changes made for pre-commit checks.
simonjayhawkins Apr 15, 2021
80d231e
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 16, 2021
c5ced5a
post merge fix-up
simonjayhawkins Apr 16, 2021
459812c
undo unrelated changes
simonjayhawkins Apr 16, 2021
d707b6b
undo changes to imports
simonjayhawkins Apr 16, 2021
71ccf24
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 17, 2021
daaac06
StringDtype.construct_array_type - add ref to issue
simonjayhawkins Apr 17, 2021
46626d1
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Apr 19, 2021
3677bfa
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 1, 2021
42d382f
post merge fixup
simonjayhawkins May 1, 2021
4fb1a0d
add draft release note
simonjayhawkins May 1, 2021
5d4eac1
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 12, 2021
15efb2e
post merge fix-up
simonjayhawkins May 12, 2021
b53cfe0
docstrings
simonjayhawkins May 12, 2021
b7db53f
benchmarks
simonjayhawkins May 12, 2021
3399f08
pyarrow min
simonjayhawkins May 12, 2021
e365f01
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 26, 2021
71d1e6c
post merge fixup
simonjayhawkins May 26, 2021
9e23c35
misc clean
simonjayhawkins May 26, 2021
c69a611
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 27, 2021
64b3206
update construct_from_string docstring
simonjayhawkins May 27, 2021
d83a4ff
update whatsnew for dtype="string"
simonjayhawkins May 27, 2021
ef38660
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 30, 2021
aef1162
update release note
simonjayhawkins May 30, 2021
6247a5b
paramertize test for df.convert_dtypes()
simonjayhawkins May 30, 2021
a6d066c
fixup pd.array and more testing of string_storage option
simonjayhawkins May 31, 2021
8adb08d
use string_storage fixture more
simonjayhawkins May 31, 2021
3ad0638
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins May 31, 2021
56714c9
post merge fixup
simonjayhawkins May 31, 2021
6a1cc2b
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 2, 2021
1761a84
remove accessor methods section from release note
simonjayhawkins Jun 2, 2021
3e26baa
consistent dtype naming in benchmark
simonjayhawkins Jun 2, 2021
6b470b1
Apply suggestions from code review
simonjayhawkins Jun 2, 2021
2ec6de0
name and str() change to "string"
simonjayhawkins Jun 2, 2021
a0b7a70
remove testing of sting dtype without storage specified.
simonjayhawkins Jun 2, 2021
d9dcd20
update StringDtype docstring
simonjayhawkins Jun 2, 2021
4a37470
add ArrowStringArray to pd.arrays namespace
simonjayhawkins Jun 2, 2021
1d59c7a
add common base class, BaseStringArray
simonjayhawkins Jun 2, 2021
e57c850
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 4, 2021
51f1b1d
fixup roundtrip tests
simonjayhawkins Jun 4, 2021
fc95c06
Merge remote-tracking branch 'upstream/master' into arrow-string-arra…
simonjayhawkins Jun 7, 2021
ef02a43
remove link
simonjayhawkins Jun 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 10 additions & 13 deletions asv_bench/benchmarks/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,41 +23,38 @@ class Factorize:
"int",
"uint",
"float",
"string",
"object",
"datetime64[ns]",
"datetime64[ns, tz]",
"Int64",
"boolean",
"string_arrow",
"string[pyarrow]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess should be consistent about using string[python] to be more explicit (rather than 'string'). i think its worth it in benchmarks for example. (and you do it on other benchmarks)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "string" used here denotes an object Index. These are not dtypes, but dictionary keys. There is no benchmark for StringArray.factorize

The last 3 are benchmarking arrays, all the others are Indexes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could maybe call it ArrowStringArray and rename the others for clarity. (does that affect the asv history?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about asv history, but nbd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string -> object in 3e26baa

],
]
param_names = ["unique", "sort", "dtype"]

def setup(self, unique, sort, dtype):
N = 10 ** 5
string_index = tm.makeStringIndex(N)
try:
from pandas.core.arrays.string_arrow import ArrowStringDtype

string_arrow = pd.array(string_index, dtype=ArrowStringDtype())
except ImportError:
string_arrow = None

if dtype == "string_arrow" and not string_arrow:
raise NotImplementedError
string_arrow = None
if dtype == "string[pyarrow]":
try:
string_arrow = pd.array(string_index, dtype="string[pyarrow]")
except ImportError:
raise NotImplementedError

data = {
"int": pd.Int64Index(np.arange(N)),
"uint": pd.UInt64Index(np.arange(N)),
"float": pd.Float64Index(np.random.randn(N)),
"string": string_index,
"object": string_index,
"datetime64[ns]": pd.date_range("2011-01-01", freq="H", periods=N),
"datetime64[ns, tz]": pd.date_range(
"2011-01-01", freq="H", periods=N, tz="Asia/Tokyo"
),
"Int64": pd.array(np.arange(N), dtype="Int64"),
"boolean": pd.array(np.random.randint(0, 2, N), dtype="boolean"),
"string_arrow": string_arrow,
"string[pyarrow]": string_arrow,
}[dtype]
if not unique:
data = data.repeat(5)
Expand Down
8 changes: 3 additions & 5 deletions asv_bench/benchmarks/algos/isin.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ class IsIn:
"category[object]",
"category[int]",
"str",
"string",
"arrow_string",
"string[python]",
"string[pyarrow]",
]
param_names = ["dtype"]

Expand Down Expand Up @@ -62,9 +62,7 @@ def setup(self, dtype):
self.values = np.random.choice(arr, sample_size)
self.series = Series(arr).astype("category")

elif dtype in ["str", "string", "arrow_string"]:
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401

elif dtype in ["str", "string[python]", "string[pyarrow]"]:
try:
self.series = Series(tm.makeStringIndex(N), dtype=dtype)
except ImportError:
Expand Down
4 changes: 1 addition & 3 deletions asv_bench/benchmarks/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,10 @@


class Dtypes:
params = ["str", "string", "arrow_string"]
params = ["str", "string[python]", "string[pyarrow]"]
param_names = ["dtype"]

def setup(self, dtype):
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401

try:
self.s = Series(tm.makeStringIndex(10 ** 5), dtype=dtype)
except ImportError:
Expand Down
1 change: 1 addition & 0 deletions doc/source/reference/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,7 @@ we recommend using :class:`StringDtype` (with the alias ``"string"``).
:template: autosummary/class_without_autosummary.rst

arrays.StringArray
arrays.ArrowStringArray

.. autosummary::
:toctree: api/
Expand Down
52 changes: 52 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,58 @@ a copy will no longer be made (:issue:`32960`)
The default behavior when not passing ``copy`` will remain unchanged, i.e.
a copy will be made.

.. _whatsnew_130.arrow_string:

PyArrow backed string data type
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We've enhanced the :class:`StringDtype`, an extension type dedicated to string data.
(:issue:`39908`)

It is now possible to specify a ``storage`` keyword option to :class:`StringDtype`. Use
pandas options or specify the dtype using ``dtype='string[pyarrow]'`` to allow the
StringArray to be backed by a PyArrow array instead of a NumPy array of Python objects.

The PyArrow backed StringArray requires pyarrow 1.0.0 or greater to be installed.

.. warning::

``string[pyarrow]`` is currently considered experimental. The implementation
and parts of the API may change without warning.

.. ipython:: python

pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow"))

You can use the alias ``"string[pyarrow]"`` as well.

.. ipython:: python

s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")
s

You can also create a PyArrow backed string array using pandas options.

.. ipython:: python

with pd.option_context("string_storage", "pyarrow"):
s = pd.Series(['abc', None, 'def'], dtype="string")
s

The usual string accessor methods work. Where appropriate, the return type of the Series
or columns of a DataFrame will also have string dtype.

.. ipython:: python

s.str.upper()
s.str.split('b', expand=True).dtypes

String accessor methods returning integers will return a value with :class:`Int64Dtype`

.. ipython:: python

s.str.count("a")

Centered Datetime-Like Rolling Windows
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
13 changes: 11 additions & 2 deletions pandas/_testing/asserters.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
TimedeltaArray,
)
from pandas.core.arrays.datetimelike import DatetimeLikeArrayMixin
from pandas.core.arrays.string_ import StringDtype

from pandas.io.formats.printing import pprint_thing

Expand Down Expand Up @@ -638,12 +639,20 @@ def raise_assert_detail(obj, message, left, right, diff=None, index_values=None)

if isinstance(left, np.ndarray):
left = pprint_thing(left)
elif is_categorical_dtype(left) or isinstance(left, PandasDtype):
elif (
is_categorical_dtype(left)
or isinstance(left, PandasDtype)
or isinstance(left, StringDtype)
):
left = repr(left)

if isinstance(right, np.ndarray):
right = pprint_thing(right)
elif is_categorical_dtype(right) or isinstance(right, PandasDtype):
elif (
is_categorical_dtype(right)
or isinstance(right, PandasDtype)
or isinstance(right, StringDtype)
):
right = repr(right)

msg += f"""
Expand Down
2 changes: 2 additions & 0 deletions pandas/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
See :ref:`extending.extension-types` for more.
"""
from pandas.core.arrays import (
ArrowStringArray,
BooleanArray,
Categorical,
DatetimeArray,
Expand All @@ -18,6 +19,7 @@
)

__all__ = [
"ArrowStringArray",
"BooleanArray",
"Categorical",
"DatetimeArray",
Expand Down
38 changes: 27 additions & 11 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1120,24 +1120,42 @@ def string_dtype(request):

@pytest.fixture(
params=[
"string",
"string[python]",
pytest.param(
"arrow_string", marks=td.skip_if_no("pyarrow", min_version="1.0.0")
"string[pyarrow]", marks=td.skip_if_no("pyarrow", min_version="1.0.0")
),
]
)
def nullable_string_dtype(request):
"""
Parametrized fixture for string dtypes.

* 'string'
* 'arrow_string'
* 'string[python]'
* 'string[pyarrow]'
"""
return request.param


@pytest.fixture(
params=[
"python",
pytest.param("pyarrow", marks=td.skip_if_no("pyarrow", min_version="1.0.0")),
]
)
def string_storage(request):
"""
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401
Parametrized fixture for pd.options.mode.string_storage.

* 'python'
* 'pyarrow'
"""
return request.param


# Alias so we can test with cartesian product of string_storage
string_storage2 = string_storage


@pytest.fixture(params=tm.BYTES_DTYPES)
def bytes_dtype(request):
"""
Expand All @@ -1163,21 +1181,19 @@ def object_dtype(request):
@pytest.fixture(
params=[
"object",
"string",
"string[python]",
pytest.param(
"arrow_string", marks=td.skip_if_no("pyarrow", min_version="1.0.0")
"string[pyarrow]", marks=td.skip_if_no("pyarrow", min_version="1.0.0")
),
]
)
def any_string_dtype(request):
"""
Parametrized fixture for string dtypes.
* 'object'
* 'string'
* 'arrow_string'
* 'string[python]'
* 'string[pyarrow]'
"""
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401

return request.param


Expand Down
2 changes: 2 additions & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@
)
from pandas.core.arrays.sparse import SparseArray
from pandas.core.arrays.string_ import StringArray
from pandas.core.arrays.string_arrow import ArrowStringArray
from pandas.core.arrays.timedeltas import TimedeltaArray

__all__ = [
"ExtensionArray",
"ExtensionOpsMixin",
"ExtensionScalarOpsMixin",
"ArrowStringArray",
"BaseMaskedArray",
"BooleanArray",
"Categorical",
Expand Down
Loading