-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: 0/frame numeric ops buggy (GH9144) #9308
Conversation
see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#id31 also IIRC there was some discussion w.r.t. to 0/0 (but might be in another issue). pls see if you can locate it. |
Hmmm... I couldn't find any 0/0 discussion (there was a discussion in #3590 about mod 0 though), but in the release notes you linked to for v.0.12.0, it said "Fix modulo and integer division on Series,DataFrames to act similary to float dtypes to return np.nan or np.inf as appropriate" and it gave 4 examples. I think the fix (PR #3600) worked except for the third example as demonstrated here:
The results are the same for Python 2. This PR would correct this. I see NumPy has some strange things going on (although it sounds like they're discussing fixing this in numpy/numpy#899), for example,
but I haven't ever seen 0/0 being infinite. I think we should change that... BTW, I'll also write a release note once we decide how to handle this. |
cc @seth-p iIIRC didn't we have s discussions about this? what are your thoughts here? |
I agree that Is the issue with [A completely separate issue is that |
missing and undefined values are both represented by nan I know others have different ways of doing this - bottoms line is pandss chose a long time ago to only have a single nan represent both - mainly this a perf issue (to have multiple missing value types) |
@Garrett-R go ahead snd make the change to make 0/0 be nan (both floats and ints) |
@seth-p, right, integer types don't support
Given that we're already casting to float, and assuming no one's planning on changing that behavior, we might as well make it |
@Garrett-R the int/float issue is already solved. Pandas handles this correctly. That said I am thinking about this again. So now |
@jreback, I believe Anyway, just check out the different languages: C++
outputs
(the negative is irrelevant) Julia
R
Mathematica
|
pls add a release note (maybe short example as well), just to make it obvious |
Sorry for the delay. I've added a release note. I also modified one more thing. For modulus, there's an issue:
The modified behavior is:
|
# Floor division must be treated specially since NumPy | ||
# causes 0//0 to be 0, but we want it to be NaN. (PR 9308) | ||
if "floordiv" in name: | ||
if not isinstance(x, np.ndarray) and x == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do: np.isscalar(..)
e0fb040
to
aecae18
Compare
if "floordiv" in name: # (PR 9308) | ||
nan_mask = ((y == 0) & (x == 0)).ravel() | ||
np.putmask(result, nan_mask, np.nan) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, you asked if it's possible to not special case this. I rewrote it to be more clean and in fact, as it's written now, you can get rid of the if "floordiv" in name:
line, and it'll be fine. It turns out that it'll only affect the floordiv operators' results anyway. However, I think we should leave it as is for two reasons: 1) efficiency and (2) I think it makes it more readable since we really are special casing floordiv to ameliorate NumPy's strange treatment of floor division.
As for efficiency, we don't need to run these two lines for any other operator, for example,
xx = pd.DataFrame(np.random.random((10000, 10000)))
yy = xx / 0
The second line took 1.59s on average, while if you get rid of the if statement above, it takes 2.05s on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
hmm 2s seems quite long for this type of operation
can u profile and see what's up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's taking so long because my DataFrame is 10e8 floats (~1GB).
Profiling the _fill_zeros
method (without the if statement), it spends 0.36s on the nan_mask =
line (and nan_mask
uses 94MB). It spends 0.06s on the following line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, maybe try taking out the .ravel()
. I think they are making copies here (and prob are not necessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, I just attempted to refactor _fill_zeros
to remove all the raveling and reshaping (which aren't necessary). It actually slowed down my test code (shown below) from 42s to 57s.
According the NumPy docs for ravel
, a "copy is made only if needed". In our case a copy wasn't being made. And the np.putmask
operation seemed to take much longer for the original array than the raveled one ‒ the line np.putmask(result, mask, fill)
(which I didn't modify) had its average runtime increase from 87ms to 548ms.
I'm also adding a comment to explain why we ravel and reshape.
Test code:
def f():
xx % 0
xx / 0
xx // 0
yy % 0
yy / 0
yy // 0
xx = pd.DataFrame(np.random.random((10000, 10000)))
yy = pd.DataFrame(np.random.random_integers(0,100, size=(10000,10000)))
for _ in range(5):
f()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profiling is the best as sometimes intuition is wrong :)
you might want to try reshaping explicitly first (just a guess) or taking a view
I think it's forcing a copy when one is not necessary I think
this op should be pretty fast - it's the copies that are getting in the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good call on the profiling!
Raveling and reshaping is helping. They're not causing any copies to be made which I confirmed by memory profiling.
But actually, there was one thing causing an unnecessary copy to be made (independent of whether we ravel or not) which was using .astype('float64')
. I've changed it to .astype('float', copy=False)
. The reason for changing 'float64' to 'float' is that I don't see any reason to force float64; isn't it better to allow the computer architecture to decide between float32 and float64?
After this change, I tried time profiling again to verify that the raveling/reshaping is still helping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, are you surprised that a 1GB data frame take 1.5s to have a division operator performed on it and have all its values updated? That seemed pretty reasonable to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no we always use float64 except if the user is explicit in not using it
so you cannot change what it is
and astype usually copies btw and not sure it's actually necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my comparision is direct numpy, e.g. what is the overhead that pandas is adding. Here is a 100MM frame (I constructed same as you). Note that using .values
like this works (with no copy) only if its only a single dtype (as is the case here). So yes I think a 3x slowdown is too much. (granted we are doing more/better ops, but should be only a limited overhead).
In [65]: xx.size
Out[65]: 100000000
In [66]: xx.size/1e6
Out[66]: 100.0
In [67]: %timeit xx.values*2
1 loops, best of 3: 521 ms per loop
In [68]: %timeit xx.values/0
1 loops, best of 3: 613 ms per loop
aecae18
to
940a98b
Compare
@@ -131,6 +131,37 @@ methods (:issue:`9088`). | |||
dtype: int64 | |||
|
|||
|
|||
- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, I tried summarizing this more as suggested. Do you think it's fine now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good
16529c3
to
19ecfc7
Compare
this should also fix #8445 yes? (if so, pls add that as a test as well), if not, see if its simple to extend to fix. |
5ba0f70
to
885407b
Compare
Thanks for explaining the float issue. I undid that change as suggested. Yes, it fixes #8445. I've added that to the commit message. You're not gonna like this but I added more special casing. It was to correct some new unexpected behavior I found: As for the slow division, that was existent before this PR. But good call on checking into that. I found that the old behavior In [1]: xx = pd.DataFrame(np.random.random((10000, 10000))) In [2]: %timeit xx / 0 new behavior In [1]: xx = pd.DataFrame(np.random.random((10000, 10000))) In [2]: %timeit xx / 0 |
shape = result.shape | ||
result = result.ravel().astype('float64') | ||
shape = result.shape | ||
result = result.ravel().astype('float64', copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also do .view('float64')
which IIRC is more idiomatic (but is basically the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think view
is slightly different since it "can cause a reinterpretation of the bytes of memory".
In [1]: x = np.array([1,2])
In [2]: x.astype('float64')
Out[2]: array([ 1., 2.])
In [3]: x.view('float64')
Out[3]: array([ 4.94065646e-324, 9.88131292e-324])
pls add a release note for #8445 (you can simply add it on where you have #9144) ok on the special casing, sometimes it is unavoidable. can you add a couple of vbenches (though use a smaller matrix, maybe 1000x1000), the proportion should still be the same; do for float/int results (for floor/div), e.g. add 4 or so (or more if you think are necessary, e.g. maybe for module too). name then consistently and add somewhere in the vbench suite. These are mainly to prevent performance regressions if/when things are changed in the future. Post the results in the top of the PR as well. thxs |
885407b
to
38856f3
Compare
@jreback, how's it looking now? Also, I've edited my first comment in the PR to include the results of the vbenches (including the ones I added). Is this what you meant by "the top of the PR"? |
@Garrett-R only show the relevant vbenches and compare vs current master e.g. bring your master up to date, then they top of the PR is correct (just replace it with the revised results). |
@@ -131,6 +131,37 @@ methods (:issue:`9088`). | |||
dtype: int64 | |||
|
|||
|
|||
- During division involving a ``Series`` or ``DataFrame``, `0/0` and `0//0` now give `np.nan` instead of `np.inf`. (:issue:`9144`, :issue:`8445`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify these with 2 backticks on each side (like you did for Series/DataFrame
, this will highlite it as code
@Garrett-R minor doc changes, and revise the vbench, otherwise looks good to go. |
…ros handles div and mod by zero
38856f3
to
85342ee
Compare
@jreback, I've made the suggested doc changes and also revised the vbenches. |
BUG: 0/frame numeric ops buggy (GH9144)
@Garrett-R thanks for this! nice work! |
# GH 6178 | ||
if np.isinf(fill): | ||
np.putmask(result,(signs<0) & mask, -fill) | ||
if "floordiv" in name: # (PR 9308) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on #19322 it looks like the problem would be solved by making this condition include other "div" operations. Was there a specific reason to only include floordiv here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't remember why only floordiv is here, but I do tend to comment non-obvious, intentional choices, so I'm guessing it was not intended that only floordiv be here.
closes #9144
closes #8445
Here's the results from testing the vbenches related to DataFrames (I also added 6 vbenches).