Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

Closed
2 of 3 tasks
xmatthias opened this issue Jul 10, 2021 · 5 comments · Fixed by #42579
Closed
2 of 3 tasks

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

xmatthias opened this issue Jul 10, 2021 · 5 comments · Fixed by #42579
Labels
DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@xmatthias
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

Minimal sample

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - triggers Performancewarnings here already.
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

df1 = df.copy()
# Triggers performance warning again
df1['c'] = np.random.randint(0, 100, size=55)

# Visualize blocks
print(df._data.nblocks)
print(df1._data.nblocks)

Problem description

Since pandas 1.3.0, the above minimal sample code produces the output of a Performance warning.
While i think i understand the warning - i don't understand how to mitigate it (the docs don't contain help i could find for this - and the proposed solution (copy() does not seem to work.

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`

While this for sure isn't an ideal scenario (assignment of single columns one after the other), i also don't see how this can be changed in our usecase.

The proposed df.copy() does not mitigate the warning - and the block count remains the same.
Based on my understanding, using df.loc[:, 'colname'] = is the recommended way to assign new columns.
This does create a new block for every insert - and df.copy() (which is proposed in the error) does not consolidate the blocks into 1 block - which means the error can't really be mitigated.

Strangely enough - the behaviour of df['colname] = and df.loc[:, 'colname'] = is not identical - with the first triggering the PerformanceWarning - and the 2nd not triggering the warning (although the problem is still there in the background).

So this leaves me with a few questions

  • How should the above scenario correctly handle inserts to keep performance and avoid this error?
  • how can the dataframe be effectively consolidated (the proposed frame.copy() in the error does not do that)

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.12.11-arch1-1
Version          : #1 SMP PREEMPT Wed, 16 Jun 2021 15:25:28 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.utf8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.3
setuptools       : 57.0.0
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : 1.10.4
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : 1.0.2
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@xmatthias xmatthias added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2021
@mzeitlin11
Copy link
Member

Thanks for reporting this @xmatthias! Bisection indicates this was introduced in #38380 (appears to be intended, with warning now given instead of automatic consolidation, cc @jbrockmendel)

@mzeitlin11
Copy link
Member

Marking as a regression though since don't think a documented change

@mzeitlin11 mzeitlin11 added DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 11, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 11, 2021
@jbrockmendel
Copy link
Member

Yes, this was intentinal.

the proposed frame.copy() in the error does not do that

This is a bug that should be fixed.

How should the above scenario correctly handle inserts to keep performance and avoid this error?

If the .copy bug is fixed, then you should be fine if you do all your inserts and then do .copy(). A better option would be to use pd.concat to do it all at once

@Alex-ley
Copy link

Alex-ley commented Feb 22, 2022

here is a concrete example of how much faster concat can be if used properly - in keeping with the sample above:
image

before 28.6 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) # only so stdout/stderr fits on 1 page in Jupyter

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    # triggers PerformanceWarnings here already.
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - also triggers PerformanceWarnings and same speed
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

after 2.33 ms ± 92.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np
import warnings
# warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) # no longer needed

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'n_{i}'] = np.random.randint(0,100,size=55)
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

another example of something that might not be immediately intuitive but makes sense when you think about it:
(obviously for x**2 you can do this with pandas vectorized methods that would be even faster but this is just to show the speedup of apply vs list comprehension. Not all functions you want to use in apply have pandas built-in equivalents)
image

before 17 ms ± 628 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'a_{i}'] = df["a"].apply(
        lambda x: x**2
    )
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

after5.99 ms ± 250 µs per loop (mean ± std. dev. of 7 runs, 100 loops each):

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

dict_of_cols = {}
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    dict_of_cols[f'a_{i}'] = [
        x**2 for x in df["a"]
    ]
    
df = pd.concat([df, pd.DataFrame(dict_of_cols)], axis=1)

@plutonium-239
Copy link

Thank you @Alex-ley for this wonderful example!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants