pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames. #25412

zachmoshe · 2019-02-22T15:19:30Z

a = pd.DataFrame({'f1': [1,2,3]})
b = pd.DataFrame({'f1': [2,3,1], 'f2': pd.Series([4,4,4]).astype('category')})

pd.concat((a,b), sort=True).dtypes
>> f1     int64
>> f2    object
>> dtype: object

Problem description

(Similar to #14016, not sure if it's caused by the same bug or another one. feel free to merge)
When concatenating two DataFrames where one has a categorical column that the other is missing, the result contains the categorical column as a 'object' (losing the "real" dtype).

If we were to fill the missing column with Nones (but with the same categorical dtype), the concatenation would keep the dtype.
In the previous example, adding:

a['f2'] = pd.Series([None, None, None]).astype(b.dtypes['f2'])

before concatenating, will solve the problem.

I believe if a field is missing from one of the merged dataframes, a reasonable behavior would be to copy it and preserve its dtype.

Expected Output

Column 'f2' should be a categorical (same as b['f2']).

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.4
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2019-02-23T20:46:41Z

Bug indeed, though uncertain if this is a complete duplicate...

cc @jreback

climatebrad · 2019-12-07T05:44:46Z

This can have severe memory consequences.

zachmoshe · 2019-12-07T07:08:16Z

This can have severe memory consequences.

That was exactly how I found that out...

climatebrad · 2019-12-07T07:28:47Z

This appears to be related to #10409.

pd.concat does not have the same behavior as DataFrame.merge, which can now handle combining categorical columns with different values in two dataframes.

mojones · 2020-05-28T09:25:19Z

For reference, we get the same effect if the column is present in both dataframes, but the categories themselves are different:

a = pd.DataFrame({'f1': [1,2,3], 'f2': pd.Series(['a', 'b', 'b']).astype('category')})
b = pd.DataFrame({'f1': [2,3,1], 'f2': pd.Series(['b', 'b', 'b']).astype('category')})

pd.concat([a,b]).dtypes

f1     int64
f2    object
dtype: object

mroeschke · 2021-06-27T18:32:31Z

Looks to work on master now. Could use a test

In [15]: a = pd.DataFrame({'f1': [1,2,3]})
    ...: b = pd.DataFrame({'f1': [2,3,1], 'f2': pd.Series([4,4,4]).astype('category')})
    ...:
    ...: pd.concat((a,b), sort=True).dtypes
Out[15]:
f1       int64
f2    category
dtype: object

yeyeric · 2022-02-23T17:01:15Z

hello,

since append is deprecated, I've migrated all my df.append(temp) to df = pd.concat([df, temp])

Usually, I have processing where I do something like:

out = pd.DataFrame()
for _, temp in df.groupby('key'):
    # SOME PROCESSING OF DATA
    out = pd.concat([out, temp]) # before: out = out.append(temp)

Here, since out is empty df at first, it will not keep dtypes from the temp df. For instance, if I have a datetime column, it's converted as object.

Is that expected ? Considering append is deprecated this has huge impact.

gfyoung added Categorical Categorical Data Type Bug labels Feb 23, 2019

jorisvandenbossche added this to the Contributions Welcome milestone Mar 4, 2019

jorisvandenbossche added good first issue and removed good first issue labels Mar 4, 2019

mroeschke added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 28, 2020

arw2019 mentioned this issue Oct 29, 2020

BUG: category index levels casted to non-category dtype in merge #37480

Open

3 tasks

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jun 27, 2021

mroeschke mentioned this issue Dec 28, 2021

TST: Add regression tests for old issues #45095

Merged

7 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Dec 28, 2021

jreback closed this as completed in #45095 Dec 28, 2021

ddrinka mentioned this issue Feb 13, 2023

BUG: pd.concat doesn't preserve categorical dtypes #51362

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames. #25412

pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames. #25412

zachmoshe commented Feb 22, 2019

gfyoung commented Feb 23, 2019

climatebrad commented Dec 7, 2019

zachmoshe commented Dec 7, 2019

climatebrad commented Dec 7, 2019

mojones commented May 28, 2020 •

edited

Loading

mroeschke commented Jun 27, 2021

yeyeric commented Feb 23, 2022 •

edited

Loading

pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames. #25412

pd.concat doesn't preserve Categorical dtype when the categorical columns is missing in one of the DataFrames. #25412

Comments

zachmoshe commented Feb 22, 2019

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Feb 23, 2019

climatebrad commented Dec 7, 2019

zachmoshe commented Dec 7, 2019

climatebrad commented Dec 7, 2019

mojones commented May 28, 2020 • edited Loading

mroeschke commented Jun 27, 2021

yeyeric commented Feb 23, 2022 • edited Loading

Output of `pd.show_versions()`

mojones commented May 28, 2020 •

edited

Loading

yeyeric commented Feb 23, 2022 •

edited

Loading