-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nonzero of a SparseArray #21175
Fix nonzero of a SparseArray #21175
Conversation
The nonzero operation returned the nonzero locations of the underlying index. However we need to get the nonzero locations in the real array. For this operation to be faster an inverse index structure would be beneficial or it could be implemented using binary search. ```python sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0]) ``` returned `0, 3, 7`. The index is shifted by two because of the two first `NaN`s and that's why the `0, 3, 7` are returned. The correct result would be `2, 5, 9` and is found in the method.
Hello @babky! Thanks for updating the PR.
Comment last updated on November 08, 2018 at 09:20 Hours UTC |
@babky int32 isn’t imported, needs to be changed to np.int32. |
@babky small fixup needed to get this working |
hi, i will fix them in near future, was on a vacation... |
needs tests, whatsnew entry, and should be a vectorized soln. |
cc @TomAugspurger pinging you here since you are working on sparse, to make sure this PR would not conflict with your work |
At a glance, this approach won’t work with my implementation since calls super to get ndarray.nonzero |
The issue looks gone in the current master. I've added some unit tests to cover the issue. From my really fast overlook of the tests I think that the tests could be added. |
Codecov Report
@@ Coverage Diff @@
## master #21175 +/- ##
==========================================
+ Coverage 92.25% 92.25% +<.01%
==========================================
Files 161 161
Lines 51390 51390
==========================================
+ Hits 47411 47412 +1
+ Misses 3979 3978 -1
Continue to review full report at Codecov.
|
@jbrockmendel @TomAugspurger: I believe the issue is gone in master branch and I just added the tests which should prevent a regression. |
pandas/tests/sparse/test_indexing.py
Outdated
@@ -455,6 +455,25 @@ def tests_indexing_with_sparse(self, kind, fill): | |||
s.iloc[indexer] | |||
|
|||
|
|||
class TestSparseArray(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this to pandas/tests/arrays/sparse/test_array.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both things completed 👍, sorry for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a whatsnew in 0.24.0.txt? I don't see one saying that we fixed nonzero.
As required in PR
Moved it one more time. We reorganized things. |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -1353,6 +1353,7 @@ Sparse | |||
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`) | |||
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`) | |||
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`) | |||
- Bug in ``SparseArray.nonzero`` and `SparseDataFrame.dropna` returning shifted/invalid results (:issue:`21172`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add :func:
to reference the api of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -747,6 +747,22 @@ def test_fillna_overlap(self): | |||
exp = SparseArray([1, 3, 3, 3, 3], fill_value=0, dtype=np.float64) | |||
tm.assert_sp_array_equal(res, exp) | |||
|
|||
def test_nonzero(self): | |||
sa = pd.SparseArray([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number
2, 0, 0, 0, | ||
3, 0, 0 | ||
]) | ||
tm.assert_numpy_array_equal(np.array([2, 5, 9], dtype=np.int32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use
result =
expected =
tm.assert_numpy_array_equal(result, expected)
@@ -1350,3 +1350,11 @@ def test_assign_with_sparse_frame(self): | |||
|
|||
for column in res.columns: | |||
assert type(res[column]) is SparseSeries | |||
|
|||
def test_dropna(self): | |||
tm.assert_sp_frame_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the result= and expected= format
add the issue number as a comment
pd.SparseDataFrame({"F2": [0, 1]}), | ||
pd.SparseDataFrame( | ||
{"F1": [float('nan'), float('nan')], "F2": [0, 1]} | ||
).dropna(axis=1, inplace=False, how='all') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test for inplace=True/False, how='all', 'any'
via parametrization
@jreback the comments should be resolved now |
df_info.txt
Outdated
@@ -0,0 +1,8 @@ | |||
<class 'pandas.core.frame.DataFrame'> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Committed by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes :-( 👍
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -1433,6 +1433,7 @@ Sparse | |||
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`) | |||
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`) | |||
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`) | |||
- Bug in :funct:``SparseArray.nonzero`` and :func:``SparseDataFrame.dropna`` returning shifted/invalid results (:issue:`21172`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function -> func.
Just single backticks around SparseArray.nonzero and SparseDataFrame.dropna
def test_dropna(self): | ||
# Tests regression #21172. | ||
expected = pd.SparseDataFrame({"F2": [0, 1]}) | ||
for inplace, how in product((True, False), ('all', 'any')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be parametrized over two hings: inplace
and how
: https://docs.pytest.org/en/latest/parametrize.html#pytest-mark-parametrize-parametrizing-test-functions
@pytest.mark.parametrize("inplcae", [True, False])
@pytest.mark.parametrize("how", ["all", "any"])
def test_dropna(self, how, inplace):
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, did not know that, thank you 🌮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you remove the df_info file and ping on green.
thanks @babky |
The nonzero operation returned the nonzero locations of the underlying index. However we need to get the nonzero locations in the real array. For this operation to be faster an inverse index structure would be beneficial or it could be implemented using binary search.
returned
0, 3, 7
. The index is shifted by two because of the two firstNaN
s and that's why the0, 3, 7
are returned. The correct result would be2, 5, 9
and is found in the method.For the above sample the code works. However for other implementations of
SparseIndex
it could be broken.git diff upstream/master -u -- "*.py" | flake8 --diff