1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

xmatthias · 2021-07-10T06:38:11Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

Minimal sample

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - triggers Performancewarnings here already.
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

df1 = df.copy()
# Triggers performance warning again
df1['c'] = np.random.randint(0, 100, size=55)

# Visualize blocks
print(df._data.nblocks)
print(df1._data.nblocks)

Problem description

Since pandas 1.3.0, the above minimal sample code produces the output of a Performance warning.
While i think i understand the warning - i don't understand how to mitigate it (the docs don't contain help i could find for this - and the proposed solution (copy() does not seem to work.

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`

While this for sure isn't an ideal scenario (assignment of single columns one after the other), i also don't see how this can be changed in our usecase.

The proposed df.copy() does not mitigate the warning - and the block count remains the same.
Based on my understanding, using df.loc[:, 'colname'] = is the recommended way to assign new columns.
This does create a new block for every insert - and df.copy() (which is proposed in the error) does not consolidate the blocks into 1 block - which means the error can't really be mitigated.

Strangely enough - the behaviour of df['colname] = and df.loc[:, 'colname'] = is not identical - with the first triggering the PerformanceWarning - and the 2nd not triggering the warning (although the problem is still there in the background).

So this leaves me with a few questions

How should the above scenario correctly handle inserts to keep performance and avoid this error?
how can the dataframe be effectively consolidated (the proposed frame.copy() in the error does not do that)

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.12.11-arch1-1
Version          : #1 SMP PREEMPT Wed, 16 Jun 2021 15:25:28 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.utf8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.3
setuptools       : 57.0.0
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : 1.10.4
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : 1.0.2
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-07-11T18:01:00Z

Thanks for reporting this @xmatthias! Bisection indicates this was introduced in #38380 (appears to be intended, with warning now given instead of automatic consolidation, cc @jbrockmendel)

mzeitlin11 · 2021-07-11T18:01:24Z

Marking as a regression though since don't think a documented change

jbrockmendel · 2021-07-11T18:53:01Z

Yes, this was intentinal.

the proposed frame.copy() in the error does not do that

This is a bug that should be fixed.

How should the above scenario correctly handle inserts to keep performance and avoid this error?

If the .copy bug is fixed, then you should be fine if you do all your inserts and then do .copy(). A better option would be to use pd.concat to do it all at once

Alex-ley · 2022-02-22T16:50:45Z

here is a concrete example of how much faster concat can be if used properly - in keeping with the sample above:

before 28.6 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) # only so stdout/stderr fits on 1 page in Jupyter

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    # triggers PerformanceWarnings here already.
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - also triggers PerformanceWarnings and same speed
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

after 2.33 ms ± 92.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np
import warnings
# warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) # no longer needed

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'n_{i}'] = np.random.randint(0,100,size=55)
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

another example of something that might not be immediately intuitive but makes sense when you think about it:
(obviously for x**2 you can do this with pandas vectorized methods that would be even faster but this is just to show the speedup of apply vs list comprehension. Not all functions you want to use in apply have pandas built-in equivalents)

before 17 ms ± 628 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'a_{i}'] = df["a"].apply(
        lambda x: x**2
    )
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

after5.99 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'a_{i}'] = [
        x**2 for x in df["a"]
    ]
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

plutonium-239 · 2022-06-29T20:17:05Z

Thank you @Alex-ley for this wonderful example!

xmatthias added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2021

mzeitlin11 added DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 11, 2021

mzeitlin11 added this to the 1.3.1 milestone Jul 11, 2021

twopirllc mentioned this issue Jul 13, 2021

PerformanceWarning: DataFrame is highly fragmented twopirllc/pandas-ta#340

Open

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 16, 2021

code sample for pandas-dev#42477

a7e15c0

jbrockmendel mentioned this issue Jul 16, 2021

BUG: DataFrame.copy not consolidating #42579

Merged

4 tasks

jreback closed this as completed in #42579 Jul 23, 2021

mzeitlin11 mentioned this issue Aug 9, 2021

ERR: clarify PerformanceWarning for fragmented frame #42942

Merged

xmatthias mentioned this issue Aug 13, 2021

PerformanceWarning: DataFrame is highly fragmented.PerformanceWarning: DataFrame is highly fragmented. freqtrade/freqtrade#5408

Closed

xmatthias mentioned this issue Oct 20, 2022

Performance Warning during hyperopt freqtrade/freqtrade#7610

Closed

This was referenced Jul 3, 2024

fix: PerformanceWarning - DataFrame is highly fragmented unit8co/darts#2446

Closed

fix: PerformanceWarning - DataFrame is highly fragmented unit8co/darts#2447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

xmatthias commented Jul 10, 2021

mzeitlin11 commented Jul 11, 2021

mzeitlin11 commented Jul 11, 2021

jbrockmendel commented Jul 11, 2021

Alex-ley commented Feb 22, 2022 •

edited

Loading

plutonium-239 commented Jun 29, 2022

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

Comments

xmatthias commented Jul 10, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

mzeitlin11 commented Jul 11, 2021

mzeitlin11 commented Jul 11, 2021

jbrockmendel commented Jul 11, 2021

Alex-ley commented Feb 22, 2022 • edited Loading

plutonium-239 commented Jun 29, 2022

Output of `pd.show_versions()`

Alex-ley commented Feb 22, 2022 •

edited

Loading