Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"invalid dtype determination in get_concat_dtype" when concating dfs with certain columns #20597

Open
Mofef opened this issue Apr 3, 2018 · 11 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@Mofef
Copy link
Contributor

Mofef commented Apr 3, 2018

Code Sample, a copy-pastable example if possible

There might be a simpler minimal example, but I was already really struggeling to identify this problem and to find this example. The problem seems to be related to strings reappearing in different positions of the tuples, different length tuples and unequal sets of columns.

(btw. I'm aware of MultiIndex, I would like to convert the Index to MultiIndex after the concatenation)

items_a = [("b","e","c","a","b"),
         ("e","e","c","a","c"),
         ("e","a","c","a","d"),
         ("b","a","b","e"),
         ("e","b","a"),
         ("e","c","c","a")]
items_b = [("b","e","c","a","b"),
         ("a","a","d","b","d"),
         ("a","b","d","b","e"),
         ("c","b","c","a"),
         ("a","c","b"),
         ("a","d","d","b")]
df1=pd.DataFrame([range(6)], columns=items_a)
df2=pd.DataFrame([range(6)], columns=items_b)
pd.concat([df1, df2])

Problem description

This yields

AssertionError: invalid dtype determination in get_concat_dtype

Expected Output

Something similar to

df1.columns = [str(c) for c in df1.columns]
df2.columns = [str(c) for c in df2.columns]
pd.concat([df1, df2])

Output of pd.show_versions()

(same result with pandas=0.17.1)

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-116-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.23.0.dev0+38.g6552718
pytest: 2.8.7
pip: 9.0.1
setuptools: 20.7.0
Cython: 0.23.4
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.3.6
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: 0.7.3
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec mx pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Apr 3, 2018

you are fighting pandas here - i suppose this could be supported but its not efficient in the least, nor very useful in terms of indexing

you would very likely need a custom index type to have an real support here - quite a major effort - if you wanted to contribute this great

@jreback
Copy link
Contributor

jreback commented Apr 3, 2018

cc @toobaz

@Mofef
Copy link
Contributor Author

Mofef commented Apr 4, 2018

Oh, so you are aware of the problem? So could you explain me a bit more about why it fails, please? I can assure you that i'm not fighting pandas on purpose. ;)
But I don't really understand what is going on. So I can't find a workaround except of converting the column names to string and back. I can't even consistently reproduce the error yet.
Maybe a more informative error message would already be enough to resolve this issue? I sure would help as soon as I understand the problem.

I in case you were wondering, what I actually do is to convert a tree structures to a pandas DataFrame. One line representing one tree. (The trees are very similar but not always identical in structure). So those tuples (columns) give the path through the tree. The data is given by the leafs.
The problem apparently occures when a child contains a similar object as its parent. For some cases it fails with the same error also if I use pd.MultiIndex.from_tuples. Though not in the example described in the OP.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2018

why are you not using a MultiIndex?

@Mofef
Copy link
Contributor Author

Mofef commented Apr 4, 2018

Originally i wanted to convert it to a MultiIndex after concatenating, but sure, that would be an acceptable workaround. Though, for my case it also failed with the same error when concatenating. (Not for the example above)

@jreback
Copy link
Contributor

jreback commented Apr 4, 2018

then show an example using MI that fails

@Mofef
Copy link
Contributor Author

Mofef commented Apr 4, 2018

Weird... my testcase must have been flawed... I can't reproduce it anymore. So thanks a lot for the help.

Still, if you had the patience to explain I would be really interested in what is going wrong in the example above.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

this actually breaks in a different place in master. cc @TomAugspurger

In [3]: items_a = [("b","e","c","a","b"),
   ...:          ("e","e","c","a","c"),
   ...:          ("e","a","c","a","d"),
   ...:          ("b","a","b","e"),
   ...:          ("e","b","a"),
   ...:          ("e","c","c","a")]
   ...: items_b = [("b","e","c","a","b"),
   ...:          ("a","a","d","b","d"),
   ...:          ("a","b","d","b","e"),
   ...:          ("c","b","c","a"),
   ...:          ("a","c","b"),
   ...:          ("a","d","d","b")]
   ...: df1=pd.DataFrame([range(6)], columns=items_a)
   ...: df2=pd.DataFrame([range(6)], columns=items_b)
   ...: pd.concat([df1, df2])
   ...:          
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-355de89f5317> in <module>()
     13 df1=pd.DataFrame([range(6)], columns=items_a)
     14 df2=pd.DataFrame([range(6)], columns=items_b)
---> 15 pd.concat([df1, df2])
     16 

~/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    211                        verify_integrity=verify_integrity,
    212                        copy=copy)
--> 213     return op.get_result()
    214 
    215 

~/pandas/pandas/core/reshape/concat.py in get_result(self)
    406             new_data = concatenate_block_managers(
    407                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 408                 copy=self.copy)
    409             if not self.copy:
    410                 new_data._consolidate_inplace()

~/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   5372                 values = values.view()
   5373             b = b.make_block_same_class(values, placement=placement)
-> 5374         elif is_uniform_join_units(join_units):
   5375             b = join_units[0].block.concat_same_type(
   5376                 [ju.block for ju in join_units], placement=placement)

~/pandas/pandas/core/internals.py in is_uniform_join_units(join_units)
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and
~/pandas/pandas/core/internals.py in <genexpr>(.0)
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and

AttributeError: 'NoneType' object has no attribute 'is_extension'
> /Users/jreback/pandas/pandas/core/internals.py(5398)<genexpr>()
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and

I didn't think a JoinUnit could be None

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Apr 10, 2018
@Mofef
Copy link
Contributor Author

Mofef commented Apr 20, 2018

#20757 might be what caused my observation that this issue also occured when using MultiIndex
(referring to @jreback 's comment here #20597 (comment) )

@TomAugspurger
Copy link
Contributor

Is this a blocker for 0.23?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2018

no - it’s pretty unusual

@mroeschke mroeschke added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants