-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed Inconsistent GroupBy Output Shape with Duplicate Column Labels #29124
Conversation
Quick summary of this change - a lot of groupby functions would store results in a dict where the label was the key and the result of the groupby op was in the values. That didn't work with duplicate labels, so this instead generates a list of series which are concatted together in the wrapping functions. Still need to run some benchmarks, but if we can settle on this or something similar for iterating columns it could make the entire set of modules much easier and consistent |
Some regressions introduced here outlined below - will see how to best tackle before after ratio
[9f038370] [5a3fcd7f]
<grpby-dup-cols~9^2> <grpby-dup-cols>
+ 1.78±0.01ms 2.16±0.1ms 1.22 groupby.RankWithTies.time_rank_ties('float32', 'min')
+ 1.76±0.02ms 2.14±0.06ms 1.22 groupby.RankWithTies.time_rank_ties('float64', 'first')
+ 1.83±0.02ms 2.21±0.1ms 1.21 groupby.RankWithTies.time_rank_ties('int64', 'max')
+ 1.73±0.04ms 2.08±0.04ms 1.20 groupby.RankWithTies.time_rank_ties('int64', 'dense')
+ 1.79±0.01ms 2.14±0.03ms 1.19 groupby.RankWithTies.time_rank_ties('datetime64', 'first')
+ 1.75±0.04ms 2.05±0.03ms 1.17 groupby.RankWithTies.time_rank_ties('int64', 'first')
+ 1.78±0.02ms 2.07±0.05ms 1.17 groupby.RankWithTies.time_rank_ties('float32', 'max')
+ 1.76±0.04ms 2.05±0.02ms 1.16 groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
+ 1.79±0.01ms 2.09±0.03ms 1.16 groupby.RankWithTies.time_rank_ties('float32', 'dense')
+ 1.73±0.03ms 2.01±0.03ms 1.16 groupby.RankWithTies.time_rank_ties('float64', 'average')
+ 148±2μs 172±10μs 1.16 groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation')
+ 147±1μs 170±8μs 1.15 groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'direct')
+ 1.79±0.01ms 2.06±0.03ms 1.15 groupby.RankWithTies.time_rank_ties('datetime64', 'min')
+ 149±4μs 171±2μs 1.15 groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation')
+ 1.80±0.02ms 2.07±0.02ms 1.15 groupby.RankWithTies.time_rank_ties('int64', 'min')
+ 1.79±0.02ms 2.04±0.06ms 1.14 groupby.RankWithTies.time_rank_ties('float32', 'average')
+ 147±1μs 167±2μs 1.14 groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'transformation')
+ 1.77±0.02ms 2.02±0.04ms 1.14 groupby.RankWithTies.time_rank_ties('float64', 'min')
+ 143±2μs 162±3μs 1.13 groupby.GroupByMethods.time_dtype_as_group('datetime', 'shift', 'transformation')
+ 145±2μs 164±2μs 1.13 groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'transformation')
+ 144±3μs 163±1μs 1.13 groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
+ 143±2μs 161±1μs 1.13 groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'transformation')
+ 1.79±0ms 2.02±0.03ms 1.13 groupby.RankWithTies.time_rank_ties('int64', 'average')
+ 146±2μs 165±3μs 1.13 groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
+ 145±3μs 163±5μs 1.13 groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
+ 142±2μs 160±2μs 1.13 groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'transformation')
+ 144±2μs 162±2μs 1.13 groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
+ 145±2μs 164±2μs 1.13 groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'transformation')
+ 150±2μs 168±2μs 1.12 groupby.GroupByMethods.time_dtype_as_group('datetime', 'all', 'direct')
+ 1.77±0.02ms 1.98±0.02ms 1.12 groupby.RankWithTies.time_rank_ties('float64', 'max')
+ 145±2μs 162±3μs 1.12 groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'direct')
+ 148±2μs 165±2μs 1.12 groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'direct')
+ 146±2μs 163±3μs 1.12 groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
+ 144±0.7μs 161±1μs 1.12 groupby.GroupByMethods.time_dtype_as_field('float', 'shift', 'transformation')
+ 238±4μs 265±8μs 1.11 groupby.GroupByMethods.time_dtype_as_field('float', 'max', 'direct')
+ 144±2μs 160±2μs 1.11 groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
+ 151±2μs 169±2μs 1.11 groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'direct')
+ 155±0.6μs 172±3μs 1.11 groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'direct')
+ 145±2μs 161±1μs 1.11 groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation')
+ 144±1μs 160±0.7μs 1.11 groupby.GroupByMethods.time_dtype_as_field('float', 'shift', 'direct')
+ 148±2μs 165±3μs 1.11 groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'direct')
+ 145±0.6μs 161±3μs 1.11 groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
+ 186±1μs 206±5μs 1.11 groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'transformation')
+ 152±2μs 168±1μs 1.11 groupby.GroupByMethods.time_dtype_as_group('datetime', 'all', 'transformation')
+ 214±2μs 236±2μs 1.11 groupby.GroupByMethods.time_dtype_as_field('datetime', 'first', 'transformation')
+ 148±1μs 163±2μs 1.11 groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
+ 185±2μs 205±2μs 1.10 groupby.GroupByMethods.time_dtype_as_group('datetime', 'ffill', 'direct')
+ 269±2μs 296±8μs 1.10 groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'transformation')
+ 148±2μs 163±2μs 1.10 groupby.GroupByMethods.time_dtype_as_group('int', 'all', 'transformation')
+ 226±5μs 249±7μs 1.10 groupby.GroupByMethods.time_dtype_as_field('datetime', 'min', 'transformation')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED. |
glad to see you're on top of this. |
Changed methodology to populate a dict where the keys are the position of the items in the columns, and a corresponding Here are updated benchmarks. Not sure which of these are legit or which are just noise . before after ratio
[ef77b570] [9eb7c739]
<krey-master^2> <grpby-dup-cols>
+ 142±0.4μs 190±20μs 1.34 groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'direct')
+ 294±3μs 389±20μs 1.32 groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'direct')
+ 144±0.9μs 181±20μs 1.26 groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'transformation')
+ 144±2μs 176±10μs 1.22 groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation')
+ 502±2μs 607±60μs 1.21 groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'direct')
+ 469±1μs 567±30μs 1.21 groupby.GroupByMethods.time_dtype_as_group('int', 'mean', 'transformation')
+ 138±0.7μs 166±10μs 1.21 groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'transformation')
+ 293±0.9μs 345±30μs 1.18 groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'transformation')
+ 138±0.7μs 162±0.9μs 1.17 groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')
+ 141±0.3μs 164±2μs 1.17 groupby.GroupByMethods.time_dtype_as_group('datetime', 'shift', 'transformation')
+ 1.70±0.01ms 1.97±0.09ms 1.16 groupby.RankWithTies.time_rank_ties('datetime64', 'first')
+ 139±0.2μs 160±0.3μs 1.15 groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'direct')
+ 140±0.2μs 160±2μs 1.15 groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'transformation')
+ 144±0.9μs 165±0.9μs 1.14 groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'direct')
+ 138±0.5μs 157±1μs 1.14 groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'transformation')
+ 1.72±0.01ms 1.95±0.2ms 1.13 groupby.RankWithTies.time_rank_ties('int64', 'min')
+ 140±2μs 158±0.8μs 1.13 groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'direct')
+ 286±1μs 323±20μs 1.13 groupby.GroupByMethods.time_dtype_as_group('float', 'bfill', 'direct')
+ 180±0.4μs 203±3μs 1.13 groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'transformation')
+ 1.03±0ms 1.16±0.04ms 1.13 groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
+ 229±1μs 257±2μs 1.12 groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'direct')
+ 427±2μs 480±40μs 1.12 groupby.GroupByMethods.time_dtype_as_field('float', 'quantile', 'direct')
+ 284±4μs 319±20μs 1.12 groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'transformation')
+ 506±4μs 568±60μs 1.12 groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct')
+ 179±0.8μs 201±0.7μs 1.12 groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'direct')
+ 144±6μs 161±10μs 1.12 groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
+ 911±2μs 1.02±0.06ms 1.12 groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
+ 1.67±0.01ms 1.86±0.06ms 1.12 groupby.RankWithTies.time_rank_ties('float64', 'average')
+ 198±0.7μs 220±2μs 1.11 groupby.GroupByMethods.time_dtype_as_field('datetime', 'last', 'transformation')
+ 181±0.5μs 202±3μs 1.11 groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'direct')
+ 233±2μs 260±20μs 1.11 groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'transformation')
+ 292±0.9μs 325±20μs 1.11 groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'direct')
+ 442±4μs 491±20μs 1.11 groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation')
+ 181±2μs 201±3μs 1.11 groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'transformation')
+ 213±2μs 235±3μs 1.10 groupby.GroupByMethods.time_dtype_as_field('datetime', 'first', 'direct')
+ 183±1μs 202±0.8μs 1.10 groupby.GroupByMethods.time_dtype_as_group('datetime', 'ffill', 'direct')
+ 228±4μs 251±10μs 1.10 groupby.GroupByMethods.time_dtype_as_field('float', 'first', 'direct')
+ 395±1μs 434±60μs 1.10 groupby.GroupByMethods.time_dtype_as_field('object', 'bfill', 'direct')
- 438±10μs 397±3μs 0.91 groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'direct')
- 2.50±0.04ms 2.25±0.03ms 0.90 groupby.GroupManyLabels.time_sum(1)
- 178±5μs 156±0.6μs 0.87 groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct')
- 308±10μs 269±1μs 0.87 groupby.GroupByMethods.time_dtype_as_field('float', 'median', 'direct')
- 1.16±0.04ms 981±5μs 0.85 groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')
- 152±10μs 127±0.5μs 0.84 groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'direct')
- 192±20μs 160±5μs 0.83 groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED. |
Assuming this fixes actually-incorrect behavior, then the perf impact shown here wouldn't be a deal-breaker |
Yea totally. I'm going to run again and see though; hopeful it was just noise |
pandas/core/groupby/generic.py
Outdated
element. The exception is operations that expand dimensions, like ohlc. | ||
""" | ||
assert len(output) == len(columns) | ||
assert list(output.keys()) == sorted(output.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn’t this what you had asked me to add asserts for previously? If not then I am confused as to what asserts you wanted
not at all. an assert about a property of a dict is not very useful.
I asked about an assert about the sortness of output. you are assigning .columns = columns. The keys must be in the same order. There is no guarantee on this at all (except that dicts are sorted in > 3.6 and I think you must be adding them in the same order). but this is non-obvious, and non-trivial.
@WillAyd I think this is ready, but if some of the assertions etc are sticking points, are there parts of this that could be done in follow-ups? |
I can address assertions here
…Sent from my iPhone
On Nov 18, 2019, at 3:04 PM, jbrockmendel ***@***.***> wrote:
@WillAyd I think this is ready, but if some of the assertions etc are sticking points, are there parts of this that could be done in follow-ups?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
OK @jreback lmk what you think about this. Instead of handling the data and column labels in two separate objects I have created a namedtuple to house both and use that as the unique key to build groupby results. These are tightly bundled until the very end, so I think should alleviate prior concerns |
kooks better ; |
thanks @WillAyd nice cleanup! |
nice! this was a long slog, well done on seeing it through |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff