"invalid dtype determination in get_concat_dtype" when concating dfs with certain columns #20597

Mofef · 2018-04-03T16:38:03Z

Code Sample, a copy-pastable example if possible

There might be a simpler minimal example, but I was already really struggeling to identify this problem and to find this example. The problem seems to be related to strings reappearing in different positions of the tuples, different length tuples and unequal sets of columns.

(btw. I'm aware of MultiIndex, I would like to convert the Index to MultiIndex after the concatenation)

items_a = [("b","e","c","a","b"),
         ("e","e","c","a","c"),
         ("e","a","c","a","d"),
         ("b","a","b","e"),
         ("e","b","a"),
         ("e","c","c","a")]
items_b = [("b","e","c","a","b"),
         ("a","a","d","b","d"),
         ("a","b","d","b","e"),
         ("c","b","c","a"),
         ("a","c","b"),
         ("a","d","d","b")]
df1=pd.DataFrame([range(6)], columns=items_a)
df2=pd.DataFrame([range(6)], columns=items_b)
pd.concat([df1, df2])

Problem description

This yields

AssertionError: invalid dtype determination in get_concat_dtype

Expected Output

Something similar to

df1.columns = [str(c) for c in df1.columns]
df2.columns = [str(c) for c in df2.columns]
pd.concat([df1, df2])

Output of `pd.show_versions()`

(same result with pandas=0.17.1)

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-116-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.23.0.dev0+38.g6552718
pytest: 2.8.7
pip: 9.0.1
setuptools: 20.7.0
Cython: 0.23.4
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.3.6
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: 0.7.5
xlsxwriter: 0.7.3
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec mx pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-04-03T16:43:27Z

you are fighting pandas here - i suppose this could be supported but its not efficient in the least, nor very useful in terms of indexing

you would very likely need a custom index type to have an real support here - quite a major effort - if you wanted to contribute this great

jreback · 2018-04-03T16:43:45Z

cc @toobaz

Mofef · 2018-04-04T09:06:32Z

Oh, so you are aware of the problem? So could you explain me a bit more about why it fails, please? I can assure you that i'm not fighting pandas on purpose. ;)
But I don't really understand what is going on. So I can't find a workaround except of converting the column names to string and back. I can't even consistently reproduce the error yet.
Maybe a more informative error message would already be enough to resolve this issue? I sure would help as soon as I understand the problem.

I in case you were wondering, what I actually do is to convert a tree structures to a pandas DataFrame. One line representing one tree. (The trees are very similar but not always identical in structure). So those tuples (columns) give the path through the tree. The data is given by the leafs.
The problem apparently occures when a child contains a similar object as its parent. For some cases it fails with the same error also if I use pd.MultiIndex.from_tuples. Though not in the example described in the OP.

jreback · 2018-04-04T13:44:21Z

why are you not using a MultiIndex?

Mofef · 2018-04-04T13:54:32Z

Originally i wanted to convert it to a MultiIndex after concatenating, but sure, that would be an acceptable workaround. Though, for my case it also failed with the same error when concatenating. (Not for the example above)

jreback · 2018-04-04T13:59:01Z

then show an example using MI that fails

Mofef · 2018-04-04T15:34:56Z

Weird... my testcase must have been flawed... I can't reproduce it anymore. So thanks a lot for the help.

Still, if you had the patience to explain I would be really interested in what is going wrong in the example above.

jreback · 2018-04-09T16:31:30Z

this actually breaks in a different place in master. cc @TomAugspurger

In [3]: items_a = [("b","e","c","a","b"),
   ...:          ("e","e","c","a","c"),
   ...:          ("e","a","c","a","d"),
   ...:          ("b","a","b","e"),
   ...:          ("e","b","a"),
   ...:          ("e","c","c","a")]
   ...: items_b = [("b","e","c","a","b"),
   ...:          ("a","a","d","b","d"),
   ...:          ("a","b","d","b","e"),
   ...:          ("c","b","c","a"),
   ...:          ("a","c","b"),
   ...:          ("a","d","d","b")]
   ...: df1=pd.DataFrame([range(6)], columns=items_a)
   ...: df2=pd.DataFrame([range(6)], columns=items_b)
   ...: pd.concat([df1, df2])
   ...:          
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-355de89f5317> in <module>()
     13 df1=pd.DataFrame([range(6)], columns=items_a)
     14 df2=pd.DataFrame([range(6)], columns=items_b)
---> 15 pd.concat([df1, df2])
     16 

~/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    211                        verify_integrity=verify_integrity,
    212                        copy=copy)
--> 213     return op.get_result()
    214 
    215 

~/pandas/pandas/core/reshape/concat.py in get_result(self)
    406             new_data = concatenate_block_managers(
    407                 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 408                 copy=self.copy)
    409             if not self.copy:
    410                 new_data._consolidate_inplace()

~/pandas/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   5372                 values = values.view()
   5373             b = b.make_block_same_class(values, placement=placement)
-> 5374         elif is_uniform_join_units(join_units):
   5375             b = join_units[0].block.concat_same_type(
   5376                 [ju.block for ju in join_units], placement=placement)

~/pandas/pandas/core/internals.py in is_uniform_join_units(join_units)
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and
~/pandas/pandas/core/internals.py in <genexpr>(.0)
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and

AttributeError: 'NoneType' object has no attribute 'is_extension'
> /Users/jreback/pandas/pandas/core/internals.py(5398)<genexpr>()
   5396         # no blocks that would get missing values (can lead to type upcasts)
   5397         # unless we're an extension dtype.
-> 5398         all(not ju.is_na or ju.block.is_extension for ju in join_units) and
   5399         # no blocks with indexers (as then the dimensions do not fit)
   5400         all(not ju.indexers for ju in join_units) and

I didn't think a JoinUnit could be None

Mofef · 2018-04-20T13:20:26Z

#20757 might be what caused my observation that this issue also occured when using MultiIndex
(referring to @jreback 's comment here #20597 (comment) )

TomAugspurger · 2018-04-22T13:16:29Z

Is this a blocker for 0.23?

jreback · 2018-04-22T13:21:12Z

no - it’s pretty unusual

toobaz mentioned this issue Apr 4, 2018

Bug: rename incapable of accepting tuples as new name #19497

Closed

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Dtype Conversions Unexpected or buggy dtype conversions labels Apr 10, 2018

mroeschke added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"invalid dtype determination in get_concat_dtype" when concating dfs with certain columns #20597

"invalid dtype determination in get_concat_dtype" when concating dfs with certain columns #20597

Mofef commented Apr 3, 2018

jreback commented Apr 3, 2018

jreback commented Apr 3, 2018

Mofef commented Apr 4, 2018

jreback commented Apr 4, 2018

Mofef commented Apr 4, 2018

jreback commented Apr 4, 2018

Mofef commented Apr 4, 2018

jreback commented Apr 9, 2018 •

edited

Loading

Mofef commented Apr 20, 2018

TomAugspurger commented Apr 22, 2018

jreback commented Apr 22, 2018

"invalid dtype determination in get_concat_dtype" when concating dfs with certain columns #20597

"invalid dtype determination in get_concat_dtype" when concating dfs with certain columns #20597

Comments

Mofef commented Apr 3, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Apr 3, 2018

jreback commented Apr 3, 2018

Mofef commented Apr 4, 2018

jreback commented Apr 4, 2018

Mofef commented Apr 4, 2018

jreback commented Apr 4, 2018

Mofef commented Apr 4, 2018

jreback commented Apr 9, 2018 • edited Loading

Mofef commented Apr 20, 2018

TomAugspurger commented Apr 22, 2018

jreback commented Apr 22, 2018

Output of `pd.show_versions()`

jreback commented Apr 9, 2018 •

edited

Loading