Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas 1.2.0 and Pyarrow [0.16.0, 1.0.0) are incompatible for some column types #38801

Closed
ADraginda opened this issue Dec 30, 2020 · 2 comments · Fixed by #38803
Closed
Labels
Bug Dependencies Required and optional dependencies ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@ADraginda
Copy link
Contributor

ADraginda commented Dec 30, 2020

#35259 added optional importing from pyarrow. The currently listed minimum version of pyarrow is 0.15.1, and the logic of said PR guards against importing attributes from pyarrow.compute because it is not available in 0.15.1

Problem: pyarrow added the compute module in 0.16.0 but attributes imported by pandas are not available in that module until 1.0.0

Therefore: with pandas 1.2.0 and merging a DataFrame (with an Array String column?) :

Pyarrow Version Behavior
not installed works
< 0.15.1 not supported
[0.15.1, 0.16.0). works (pyarrow.compute.equal not used?)
[0.16.0, 1.0.0) ArgumentError
[1.0.0, 2.0.0 (latest)] works (using pyarrow.compute.equal)

Adding another try/except around the comparison functions in pandas string_arrow.py will change the table of functionality to:

Pyarrow Version Behavior
not installed works
< 0.15.1 not supported
[0.15.1, 1.0.0). works (pyarrow.compute.equal not used?)
[1.0.0, 2.0.0 (latest)] works (using pyarrow.compute.equal)
@ADraginda ADraginda added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 30, 2020
@ADraginda ADraginda changed the title BUG: pandas 1.2.0 and Pyarrow (0.16.0, 1.0.0] are incompatible for some column types BUG: pandas 1.2.0 and Pyarrow [0.16.0, 1.0.0) are incompatible for some column types Dec 30, 2020
@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

see my comments on the PR. what causes this error exactly?

e.g. what is

Therefore: with pandas 1.2.0 and merging a DataFrame (with an Array String column?) :

@jreback jreback added Dependencies Required and optional dependencies ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 30, 2020
@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Dec 30, 2020
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Dec 30, 2020
@ADraginda
Copy link
Contributor Author

@jreback I replied in the PR as well but here is a reproduction:

You can reproduce with pyarrow 0.17.1 (that's what I'm on, but breaks for other versions too) with the following example. Seems to be related to merging tow tables, one on a column of type string and the other of type object.

import pandas as pd

example_dict = {'i': {0: 'foo'}, 'j': {0: 1}}

foo = pd.DataFrame(example_dict)
foo['i'] = foo['i'].astype(pd.StringDtype.name)
foo['j'] = foo['j'].astype(pd.Int64Dtype.name)

bar = pd.DataFrame(example_dict)
bar['j'] = bar['j'].astype(pd.Int64Dtype.name)

foo.merge(bar)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dependencies Required and optional dependencies ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants