Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Inconsistent GroupBy Output Shape with Duplicate Column Labels #29124

Merged
merged 61 commits into from
Nov 20, 2019

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Oct 21, 2019

@WillAyd
Copy link
Member Author

WillAyd commented Oct 21, 2019

Quick summary of this change - a lot of groupby functions would store results in a dict where the label was the key and the result of the groupby op was in the values. That didn't work with duplicate labels, so this instead generates a list of series which are concatted together in the wrapping functions.

Still need to run some benchmarks, but if we can settle on this or something similar for iterating columns it could make the entire set of modules much easier and consistent

@WillAyd
Copy link
Member Author

WillAyd commented Oct 21, 2019

Some regressions introduced here outlined below - will see how to best tackle

       before           after         ratio
     [9f038370]       [5a3fcd7f]
     <grpby-dup-cols~9^2>       <grpby-dup-cols>
+     1.78±0.01ms       2.16±0.1ms     1.22  groupby.RankWithTies.time_rank_ties('float32', 'min')
+     1.76±0.02ms      2.14±0.06ms     1.22  groupby.RankWithTies.time_rank_ties('float64', 'first')
+     1.83±0.02ms       2.21±0.1ms     1.21  groupby.RankWithTies.time_rank_ties('int64', 'max')
+     1.73±0.04ms      2.08±0.04ms     1.20  groupby.RankWithTies.time_rank_ties('int64', 'dense')
+     1.79±0.01ms      2.14±0.03ms     1.19  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
+     1.75±0.04ms      2.05±0.03ms     1.17  groupby.RankWithTies.time_rank_ties('int64', 'first')
+     1.78±0.02ms      2.07±0.05ms     1.17  groupby.RankWithTies.time_rank_ties('float32', 'max')
+     1.76±0.04ms      2.05±0.02ms     1.16  groupby.RankWithTies.time_rank_ties('datetime64', 'dense')
+     1.79±0.01ms      2.09±0.03ms     1.16  groupby.RankWithTies.time_rank_ties('float32', 'dense')
+     1.73±0.03ms      2.01±0.03ms     1.16  groupby.RankWithTies.time_rank_ties('float64', 'average')
+         148±2μs         172±10μs     1.16  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'transformation')
+         147±1μs          170±8μs     1.15  groupby.GroupByMethods.time_dtype_as_group('int', 'any', 'direct')
+     1.79±0.01ms      2.06±0.03ms     1.15  groupby.RankWithTies.time_rank_ties('datetime64', 'min')
+         149±4μs          171±2μs     1.15  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation')
+     1.80±0.02ms      2.07±0.02ms     1.15  groupby.RankWithTies.time_rank_ties('int64', 'min')
+     1.79±0.02ms      2.04±0.06ms     1.14  groupby.RankWithTies.time_rank_ties('float32', 'average')
+         147±1μs          167±2μs     1.14  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'transformation')
+     1.77±0.02ms      2.02±0.04ms     1.14  groupby.RankWithTies.time_rank_ties('float64', 'min')
+         143±2μs          162±3μs     1.13  groupby.GroupByMethods.time_dtype_as_group('datetime', 'shift', 'transformation')
+         145±2μs          164±2μs     1.13  groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'transformation')
+         144±3μs          163±1μs     1.13  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'transformation')
+         143±2μs          161±1μs     1.13  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'transformation')
+        1.79±0ms      2.02±0.03ms     1.13  groupby.RankWithTies.time_rank_ties('int64', 'average')
+         146±2μs          165±3μs     1.13  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'transformation')
+         145±3μs          163±5μs     1.13  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'direct')
+         142±2μs          160±2μs     1.13  groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'transformation')
+         144±2μs          162±2μs     1.13  groupby.GroupByMethods.time_dtype_as_field('float', 'all', 'direct')
+         145±2μs          164±2μs     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'transformation')
+         150±2μs          168±2μs     1.12  groupby.GroupByMethods.time_dtype_as_group('datetime', 'all', 'direct')
+     1.77±0.02ms      1.98±0.02ms     1.12  groupby.RankWithTies.time_rank_ties('float64', 'max')
+         145±2μs          162±3μs     1.12  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'direct')
+         148±2μs          165±2μs     1.12  groupby.GroupByMethods.time_dtype_as_group('datetime', 'any', 'direct')
+         146±2μs          163±3μs     1.12  groupby.GroupByMethods.time_dtype_as_field('int', 'all', 'transformation')
+       144±0.7μs          161±1μs     1.12  groupby.GroupByMethods.time_dtype_as_field('float', 'shift', 'transformation')
+         238±4μs          265±8μs     1.11  groupby.GroupByMethods.time_dtype_as_field('float', 'max', 'direct')
+         144±2μs          160±2μs     1.11  groupby.GroupByMethods.time_dtype_as_field('int', 'any', 'transformation')
+         151±2μs          169±2μs     1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'direct')
+       155±0.6μs          172±3μs     1.11  groupby.GroupByMethods.time_dtype_as_field('int', 'shift', 'direct')
+         145±2μs          161±1μs     1.11  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation')
+         144±1μs        160±0.7μs     1.11  groupby.GroupByMethods.time_dtype_as_field('float', 'shift', 'direct')
+         148±2μs          165±3μs     1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'direct')
+       145±0.6μs          161±3μs     1.11  groupby.GroupByMethods.time_dtype_as_field('float', 'any', 'direct')
+         186±1μs          206±5μs     1.11  groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'transformation')
+         152±2μs          168±1μs     1.11  groupby.GroupByMethods.time_dtype_as_group('datetime', 'all', 'transformation')
+         214±2μs          236±2μs     1.11  groupby.GroupByMethods.time_dtype_as_field('datetime', 'first', 'transformation')
+         148±1μs          163±2μs     1.11  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
+         185±2μs          205±2μs     1.10  groupby.GroupByMethods.time_dtype_as_group('datetime', 'ffill', 'direct')
+         269±2μs          296±8μs     1.10  groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'transformation')
+         148±2μs          163±2μs     1.10  groupby.GroupByMethods.time_dtype_as_group('int', 'all', 'transformation')
+         226±5μs          249±7μs     1.10  groupby.GroupByMethods.time_dtype_as_field('datetime', 'min', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jbrockmendel
Copy link
Member

That didn't work with duplicate labels

glad to see you're on top of this.

@WillAyd
Copy link
Member Author

WillAyd commented Oct 22, 2019

Changed methodology to populate a dict where the keys are the position of the items in the columns, and a corresponding names list with the appropriate labels. This seems a little faster than generating a list of Series and concatenating

Here are updated benchmarks. Not sure which of these are legit or which are just noise

.       before           after         ratio
     [ef77b570]       [9eb7c739]
     <krey-master^2>       <grpby-dup-cols>
+       142±0.4μs         190±20μs     1.34  groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'direct')
+         294±3μs         389±20μs     1.32  groupby.GroupByMethods.time_dtype_as_field('int', 'max', 'direct')
+       144±0.9μs         181±20μs     1.26  groupby.GroupByMethods.time_dtype_as_group('float', 'shift', 'transformation')
+         144±2μs         176±10μs     1.22  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'transformation')
+         502±2μs         607±60μs     1.21  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'direct')
+         469±1μs         567±30μs     1.21  groupby.GroupByMethods.time_dtype_as_group('int', 'mean', 'transformation')
+       138±0.7μs         166±10μs     1.21  groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'transformation')
+       293±0.9μs         345±30μs     1.18  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'transformation')
+       138±0.7μs        162±0.9μs     1.17  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')
+       141±0.3μs          164±2μs     1.17  groupby.GroupByMethods.time_dtype_as_group('datetime', 'shift', 'transformation')
+     1.70±0.01ms      1.97±0.09ms     1.16  groupby.RankWithTies.time_rank_ties('datetime64', 'first')
+       139±0.2μs        160±0.3μs     1.15  groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'direct')
+       140±0.2μs          160±2μs     1.15  groupby.GroupByMethods.time_dtype_as_group('object', 'any', 'transformation')
+       144±0.9μs        165±0.9μs     1.14  groupby.GroupByMethods.time_dtype_as_group('float', 'any', 'direct')
+       138±0.5μs          157±1μs     1.14  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'transformation')
+     1.72±0.01ms       1.95±0.2ms     1.13  groupby.RankWithTies.time_rank_ties('int64', 'min')
+         140±2μs        158±0.8μs     1.13  groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'direct')
+         286±1μs         323±20μs     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'bfill', 'direct')
+       180±0.4μs          203±3μs     1.13  groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'transformation')
+        1.03±0ms      1.16±0.04ms     1.13  groupby.GroupByMethods.time_dtype_as_field('int', 'value_counts', 'transformation')
+         229±1μs          257±2μs     1.12  groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'direct')
+         427±2μs         480±40μs     1.12  groupby.GroupByMethods.time_dtype_as_field('float', 'quantile', 'direct')
+         284±4μs         319±20μs     1.12  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'transformation')
+         506±4μs         568±60μs     1.12  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct')
+       179±0.8μs        201±0.7μs     1.12  groupby.GroupByMethods.time_dtype_as_group('object', 'bfill', 'direct')
+         144±6μs         161±10μs     1.12  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
+         911±2μs      1.02±0.06ms     1.12  groupby.GroupByMethods.time_dtype_as_group('float', 'sem', 'direct')
+     1.67±0.01ms      1.86±0.06ms     1.12  groupby.RankWithTies.time_rank_ties('float64', 'average')
+       198±0.7μs          220±2μs     1.11  groupby.GroupByMethods.time_dtype_as_field('datetime', 'last', 'transformation')
+       181±0.5μs          202±3μs     1.11  groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'direct')
+         233±2μs         260±20μs     1.11  groupby.GroupByMethods.time_dtype_as_field('float', 'prod', 'transformation')
+       292±0.9μs         325±20μs     1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'direct')
+         442±4μs         491±20μs     1.11  groupby.GroupByMethods.time_dtype_as_group('float', 'std', 'transformation')
+         181±2μs          201±3μs     1.11  groupby.GroupByMethods.time_dtype_as_group('object', 'ffill', 'transformation')
+         213±2μs          235±3μs     1.10  groupby.GroupByMethods.time_dtype_as_field('datetime', 'first', 'direct')
+         183±1μs        202±0.8μs     1.10  groupby.GroupByMethods.time_dtype_as_group('datetime', 'ffill', 'direct')
+         228±4μs         251±10μs     1.10  groupby.GroupByMethods.time_dtype_as_field('float', 'first', 'direct')
+         395±1μs         434±60μs     1.10  groupby.GroupByMethods.time_dtype_as_field('object', 'bfill', 'direct')
-        438±10μs          397±3μs     0.91  groupby.GroupByMethods.time_dtype_as_group('float', 'head', 'direct')
-     2.50±0.04ms      2.25±0.03ms     0.90  groupby.GroupManyLabels.time_sum(1)
-         178±5μs        156±0.6μs     0.87  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct')
-        308±10μs          269±1μs     0.87  groupby.GroupByMethods.time_dtype_as_field('float', 'median', 'direct')
-     1.16±0.04ms          981±5μs     0.85  groupby.GroupByMethods.time_dtype_as_field('datetime', 'value_counts', 'transformation')
-        152±10μs        127±0.5μs     0.84  groupby.GroupByMethods.time_dtype_as_field('int', 'size', 'direct')
-        192±20μs          160±5μs     0.83  groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jbrockmendel
Copy link
Member

Assuming this fixes actually-incorrect behavior, then the perf impact shown here wouldn't be a deal-breaker

@WillAyd
Copy link
Member Author

WillAyd commented Oct 22, 2019

Yea totally. I'm going to run again and see though; hopeful it was just noise

@jbrockmendel jbrockmendel mentioned this pull request Nov 18, 2019
element. The exception is operations that expand dimensions, like ohlc.
"""
assert len(output) == len(columns)
assert list(output.keys()) == sorted(output.keys())
Copy link
Contributor

@jreback jreback Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t this what you had asked me to add asserts for previously? If not then I am confused as to what asserts you wanted

not at all. an assert about a property of a dict is not very useful.

I asked about an assert about the sortness of output. you are assigning .columns = columns. The keys must be in the same order. There is no guarantee on this at all (except that dicts are sorted in > 3.6 and I think you must be adding them in the same order). but this is non-obvious, and non-trivial.

@jbrockmendel
Copy link
Member

@WillAyd I think this is ready, but if some of the assertions etc are sticking points, are there parts of this that could be done in follow-ups?

@WillAyd
Copy link
Member Author

WillAyd commented Nov 18, 2019 via email

@WillAyd
Copy link
Member Author

WillAyd commented Nov 19, 2019

OK @jreback lmk what you think about this. Instead of handling the data and column labels in two separate objects I have created a namedtuple to house both and use that as the unique key to build groupby results. These are tightly bundled until the very end, so I think should alleviate prior concerns

@jreback
Copy link
Contributor

jreback commented Nov 19, 2019

kooks better ;
will have a more thorough look tomorrow

@jreback jreback merged commit bbc7173 into pandas-dev:master Nov 20, 2019
@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

thanks @WillAyd nice cleanup!

@jbrockmendel
Copy link
Member

nice! this was a long slog, well done on seeing it through

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy any/all Fails with Duplicate Column Names
5 participants