Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: concat perf #23362

Closed
TomAugspurger opened this issue Oct 26, 2018 · 4 comments
Closed

PERF: concat perf #23362

TomAugspurger opened this issue Oct 26, 2018 · 4 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance Period Period data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 26, 2018

For Series[period], pd.concat is about 6x slower than PeriodArray._concat_same_type. There's always going to be some overhead, but I wonder how much we can narrow this.

In [1]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: a = np.random.randint(2000, 2100, size=1000)
   ...: b = np.random.randint(2000, 2100, size=1000)
   ...:
   ...: x = pd.core.arrays.period_array(a, freq='B')
   ...: y = pd.core.arrays.period_array(b, freq='B')
   ...:
   ...: s = pd.Series(x)
   ...: t = pd.Series(y)


In [2]: %timeit pd.concat([s, t], ignore_index=True)
523 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit x._concat_same_type([x, y])
90.1 µs ± 948 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 26, 2018
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Oct 26, 2018
@qwhelan
Copy link
Contributor

qwhelan commented Oct 29, 2018

Based on my testing it seems this unfortunately isn't PeriodArray specific but instead a more general performance issue due to some unnecessary DataFrame creation when calling pd.concat() on only Series objects - I have a simple fix for one hot spot and am chasing down another now.

@TomAugspurger
Copy link
Contributor Author

With #23404 we're down to 2x slower

In [4]: %timeit x._concat_same_type([x, y])
   ...:
99.8 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit pd.concat([s, t], ignore_index=True, copy=False)
   ...:
   ...:
194 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I looked very briefly at the snakeviz profile after that was merged. A decent chunk of it was in creating an Index. That's necessarily going to slow pd.concat down, but could perhaps be improved a bit.

@mroeschke mroeschke added the Period Period data type label Dec 23, 2018
@arw2019
Copy link
Member

arw2019 commented Sep 24, 2020

The difference on 1.2 master is 10x, although it looks like that's because of a significant (10x) improvement in PeriodArray._concat_same_type that hasn't been matched by the speed-up in pd.concat:

In [2]:  %timeit x._concat_same_type([x, y])                                                           
9.53 µs ± 29.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]:  %timeit pd.concat([s, t], ignore_index=True)                                                  
100 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
This was referenced Mar 29, 2023
@jbrockmendel
Copy link
Member

Still very close to 12x on main following #52290 and #52291, but now those are 4 µs and 47 µs, respectively.

About 11% of the runtime of concat is in __finalize__, xref #51280. Other than that I'm out of ideas on optimizing this any further.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance Period Period data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants