Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nonzero of a SparseArray #21175

Merged
merged 22 commits into from
Nov 17, 2018
Merged

Fix nonzero of a SparseArray #21175

merged 22 commits into from
Nov 17, 2018

Conversation

babky
Copy link
Contributor

@babky babky commented May 22, 2018

The nonzero operation returned the nonzero locations of the underlying index. However we need to get the nonzero locations in the real array. For this operation to be faster an inverse index structure would be beneficial or it could be implemented using binary search.

sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0])

returned 0, 3, 7. The index is shifted by two because of the two first NaNs and that's why the 0, 3, 7 are returned. The correct result would be 2, 5, 9 and is found in the method.

For the above sample the code works. However for other implementations of SparseIndex it could be broken.

The nonzero operation returned the nonzero locations of the underlying index. However we need to get the nonzero locations in the real array. For this operation to be faster an inverse index structure would be beneficial or it could be implemented using binary search.

```python
sa = pd.SparseArray([float('nan'), float('nan'), 1, 0, 0, 2, 0, 0, 0, 3, 0, 0])
```
returned `0, 3, 7`. The index is shifted by two because of the two first `NaN`s and that's why the `0, 3, 7` are returned. The correct result would be `2, 5, 9` and is found in the method.
@pep8speaks
Copy link

pep8speaks commented May 22, 2018

Hello @babky! Thanks for updating the PR.

Comment last updated on November 08, 2018 at 09:20 Hours UTC

@mroeschke mroeschke added the Sparse Sparse Data Type label May 23, 2018
@jbrockmendel
Copy link
Member

@babky int32 isn’t imported, needs to be changed to np.int32.

@jbrockmendel
Copy link
Member

@babky small fixup needed to get this working

@babky
Copy link
Contributor Author

babky commented Aug 20, 2018

hi, i will fix them in near future, was on a vacation...

@babky babky closed this Aug 20, 2018
@babky babky reopened this Aug 20, 2018
@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

needs tests, whatsnew entry, and should be a vectorized soln.

@jorisvandenbossche
Copy link
Member

cc @TomAugspurger pinging you here since you are working on sparse, to make sure this PR would not conflict with your work

@TomAugspurger
Copy link
Contributor

At a glance, this approach won’t work with my implementation since calls super to get ndarray.nonzero

@babky
Copy link
Contributor Author

babky commented Nov 8, 2018

The issue looks gone in the current master. I've added some unit tests to cover the issue. From my really fast overlook of the tests I think that the tests could be added.

@codecov
Copy link

codecov bot commented Nov 8, 2018

Codecov Report

Merging #21175 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21175      +/-   ##
==========================================
+ Coverage   92.25%   92.25%   +<.01%     
==========================================
  Files         161      161              
  Lines       51390    51390              
==========================================
+ Hits        47411    47412       +1     
+ Misses       3979     3978       -1
Flag Coverage Δ
#multiple 90.65% <ø> (ø) ⬆️
#single 42.31% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/sparse.py 91.94% <0%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db2066b...2c912f1. Read the comment docs.

@babky
Copy link
Contributor Author

babky commented Nov 9, 2018

@jbrockmendel @TomAugspurger: I believe the issue is gone in master branch and I just added the tests which should prevent a regression.

@@ -455,6 +455,25 @@ def tests_indexing_with_sparse(self, kind, fill):
s.iloc[indexer]


class TestSparseArray(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to pandas/tests/arrays/sparse/test_array.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both things completed 👍, sorry for it.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a whatsnew in 0.24.0.txt? I don't see one saying that we fixed nonzero.

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 11, 2018
@TomAugspurger
Copy link
Contributor

Moved it one more time. We reorganized things.

@@ -1353,6 +1353,7 @@ Sparse
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`)
- Bug in ``SparseArray.nonzero`` and `SparseDataFrame.dropna` returning shifted/invalid results (:issue:`21172`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add :func: to reference the api of these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -747,6 +747,22 @@ def test_fillna_overlap(self):
exp = SparseArray([1, 3, 3, 3, 3], fill_value=0, dtype=np.float64)
tm.assert_sp_array_equal(res, exp)

def test_nonzero(self):
sa = pd.SparseArray([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number

2, 0, 0, 0,
3, 0, 0
])
tm.assert_numpy_array_equal(np.array([2, 5, 9], dtype=np.int32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use

result = 
expected = 
tm.assert_numpy_array_equal(result, expected)

@@ -1350,3 +1350,11 @@ def test_assign_with_sparse_frame(self):

for column in res.columns:
assert type(res[column]) is SparseSeries

def test_dropna(self):
tm.assert_sp_frame_equal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the result= and expected= format

add the issue number as a comment

pd.SparseDataFrame({"F2": [0, 1]}),
pd.SparseDataFrame(
{"F1": [float('nan'), float('nan')], "F2": [0, 1]}
).dropna(axis=1, inplace=False, how='all')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test for inplace=True/False, how='all', 'any'

via parametrization

@babky
Copy link
Contributor Author

babky commented Nov 16, 2018

@jreback the comments should be resolved now

df_info.txt Outdated
@@ -0,0 +1,8 @@
<class 'pandas.core.frame.DataFrame'>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Committed by accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :-( 👍

@@ -1433,6 +1433,7 @@ Sparse
- Bug in ``DataFrame.groupby`` not including ``fill_value`` in the groups for non-NA ``fill_value`` when grouping by a sparse column (:issue:`5078`)
- Bug in unary inversion operator (``~``) on a ``SparseSeries`` with boolean values. The performance of this has also been improved (:issue:`22835`)
- Bug in :meth:`SparseArary.unique` not returning the unique values (:issue:`19595`)
- Bug in :funct:``SparseArray.nonzero`` and :func:``SparseDataFrame.dropna`` returning shifted/invalid results (:issue:`21172`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function -> func.

Just single backticks around SparseArray.nonzero and SparseDataFrame.dropna

def test_dropna(self):
# Tests regression #21172.
expected = pd.SparseDataFrame({"F2": [0, 1]})
for inplace, how in product((True, False), ('all', 'any')):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be parametrized over two hings: inplace and how: https://docs.pytest.org/en/latest/parametrize.html#pytest-mark-parametrize-parametrizing-test-functions

@pytest.mark.parametrize("inplcae", [True, False])
@pytest.mark.parametrize("how", ["all", "any"])
def test_dropna(self, how, inplace):
    ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, did not know that, thank you 🌮

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you remove the df_info file and ping on green.

@jreback jreback merged commit 3abff0d into pandas-dev:master Nov 17, 2018
@jreback
Copy link
Contributor

jreback commented Nov 17, 2018

thanks @babky

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Method dropna does not work on SparseDataFrames
7 participants