Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: use dispatch_to_series where possible #22534

Closed
wants to merge 6 commits into from

Conversation

jbrockmendel
Copy link
Member

A bunch of PRs touching DataFrame ops have gone through recently. This does some follow-up cleanup to unify the way things are done across a few different methods.

@pep8speaks
Copy link

pep8speaks commented Aug 29, 2018

Hello @jbrockmendel! Thanks for updating the PR.

Line 1646:13: W504 line break after binary operator

Comment last updated on September 07, 2018 at 16:19 Hours UTC

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Clean labels Aug 31, 2018
@jreback jreback added this to the 0.24.0 milestone Aug 31, 2018
@jreback
Copy link
Contributor

jreback commented Aug 31, 2018

can you check perf. rebase as well.

@jbrockmendel
Copy link
Member Author

can you check perf.

First attempt to check perf turned up a bug in master:

dti = pd.date_range('2016-01-01', periods=10000)
tdi = pd.timedelta_range('1', periods=10000)
tser = pd.Series(tdi)
df = pd.DataFrame({0: dti, 1: tdi})
>>> df.add(tser, axis=0)

Expected (which the PR gets right):

                              0                      1
0 2016-01-01 00:00:00.000000001 0 days 00:00:00.000000
1 2016-01-03 00:00:00.000000001 2 days 00:00:00.000000
2 2016-01-05 00:00:00.000000001 4 days 00:00:00.000000

master raises

ValueError: operands could not be broadcast together with shapes (20000,) (10000,) 

I'll add a test for this.

Non-broken cases, first a many-column case where we expect master to perform well:

df = pd.DataFrame(np.random.randn(100000, 60))
df[10:20] = df[10:20].astype('f4')
df[20:30] = df[20:30].astype('i8')
df[30:40] = df[30:40].astype('i4')
df[40:50] = df[40:50].astype('u8')
df[50:60] = df[50:60].astype('u4')

In [29]: %timeit out = df.add(df[0], axis=0)
100 loops, best of 3: 18.1 ms per loop   <-- master
100 loops, best of 3: 18.3 ms per loop  <-- PR

And a few-column case where we expect master to do poorly:

df = pd.DataFrame(np.random.randn(10000000, 6))
df[1] = df[1].astype('f4')
df[2] = df[2].astype('i8')
df[3] = df[3].astype('i4')
df[4] = df[4].astype('u8')
df[5] = df[5].astype('u4')

%timeit out = df.add(df[0], axis=0)
1 loop, best of 3: 903 ms per loop   <-- master
1 loop, best of 3: 582 ms per loop   <-- PR

As elsewhere, I expect perf to improve after #22284.

@jreback
Copy link
Contributor

jreback commented Sep 4, 2018

this duplicates #22572 a lot. is this the original?

@jbrockmendel
Copy link
Member Author

this duplicates #22572 a lot. is this the original?

This is original. In the course of profiling this I found that this fixes a previously-unknown bug. So this now needs tests and whatsnew etc. #22572 splits off the still-easy part of this.

@jbrockmendel
Copy link
Member Author

To fully finish this off will require resolution to #22614.

@codecov
Copy link

codecov bot commented Sep 7, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@5eb9988). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #22534   +/-   ##
=========================================
  Coverage          ?   92.05%           
=========================================
  Files             ?      169           
  Lines             ?    50787           
  Branches          ?        0           
=========================================
  Hits              ?    46753           
  Misses            ?     4034           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.46% <100%> (?)
#single 42.29% <0%> (?)
Impacted Files Coverage Δ
pandas/core/ops.py 96.91% <100%> (ø)
pandas/core/frame.py 97.2% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5eb9988...8f0cdbc. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2018

can you rebase

@jbrockmendel
Copy link
Member Author

can you rebase

Sure. Big evening for merging. Note comment above about #22614.

@jbrockmendel
Copy link
Member Author

Heh, looks like the other PRs merged this evening already cover this. Closing. Will need to follow-up with a test for the bug that was accidentally fixed.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2018

great! I am not really sure about #22614, none of the options are really palatable.

@jbrockmendel
Copy link
Member Author

I am not really sure about #22614, none of the options are really palatable.

Well we've de-facto been going down the path of option 1. I actually prefer option 2 longer-term (better to discuss there), but for the time being correctness-first seems to favor option 1, and #22284 should take some of the pain out of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants