-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix/test SparseSeries/SparseDataFrame stack/unstack #16616
Conversation
88656a8
to
05aefcb
Compare
Codecov Report
@@ Coverage Diff @@
## master #16616 +/- ##
==========================================
- Coverage 91.26% 91.24% -0.02%
==========================================
Files 163 163
Lines 49776 49761 -15
==========================================
- Hits 45426 45405 -21
- Misses 4350 4356 +6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic looks fine. see if you can move this to internals to follow existing style.
doc/source/whatsnew/v0.20.2.txt
Outdated
@@ -52,6 +52,9 @@ Bug Fixes | |||
- Bug in :func:`cut` when ``labels`` are set, resulting in incorrect label ordering (:issue:`16459`) | |||
- Fixed a compatibility issue with IPython 6.0's tab completion showing deprecation warnings on ``Categoricals`` (:issue:`16409`) | |||
|
|||
- Bug in ``SparseSeries.unstack()`` and ``SparseDataFrame.stack()`` (:issue:`16614`, :issue:`15045`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.21.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change to use :func:SparseSeries.unstack
(etc)
pandas/core/reshape/reshape.py
Outdated
result = DataFrame(BlockManager(new_blocks, new_axes)) | ||
mask_frame = DataFrame(BlockManager(mask_blocks, new_axes)) | ||
return result.loc[:, mask_frame.sum(0) > 0] | ||
# BlockManager can't handle SparseBlocks with multiple items, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to move this logic to a function in internals. e.g. you define BlockManager.unstack
, then you have an unstack on Block. This way the override logic is pretty straightforward as its done on a block basis, and this is setup so that you can return multiple blocks (in case of sparse) from these routines.
you can pass the _Unstacker
as a partial function, and so the entire loop looks something like
obj._data.unstack(_Unstacker)
then call the _Unstacker
inside the block.unstack()
method.
doc/source/whatsnew/v0.20.2.txt
Outdated
@@ -52,6 +52,9 @@ Bug Fixes | |||
- Bug in :func:`cut` when ``labels`` are set, resulting in incorrect label ordering (:issue:`16459`) | |||
- Fixed a compatibility issue with IPython 6.0's tab completion showing deprecation warnings on ``Categoricals`` (:issue:`16409`) | |||
|
|||
- Bug in ``SparseSeries.unstack()`` and ``SparseDataFrame.stack()`` (:issue:`16614`, :issue:`15045`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change to use :func:SparseSeries.unstack
(etc)
if values.ndim == 1: | ||
if isinstance(values, Categorical): | ||
self.is_categorical = values | ||
values = np.array(values) | ||
elif self.is_sparse: | ||
# XXX: Makes SparseArray *dense*, but it's supposedly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a TODO? or a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment with a mild TODO hint at whomever will be refactoring the whole thing eventually to take the note into consideration.
pandas/tests/test_multilevel.py
Outdated
@@ -2381,6 +2381,29 @@ def test_iloc_mi(self): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
class TestSparse(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use more pytest style here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have. But with pytest style promoting flat, module-level functions only, the module namespace will get polluted fast and often, so the tests modules will likely benefit from becoming more fine grained.
05aefcb
to
b7b76fd
Compare
pandas/core/categorical.py
Outdated
@@ -127,6 +127,8 @@ def maybe_to_categorical(array): | |||
""" coerce to a categorical if a series is given """ | |||
if isinstance(array, (ABCSeries, ABCCategoricalIndex)): | |||
return array._values | |||
elif isinstance(array, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in NonConsolidatingBlock._unstack()
calling .make_block_same_class()
, making a CategoricalBlock
from unstacked values
(an ndarray
). Otherwise fails frame.test_reshape.TestDataFrameReshape.test_unstack_preserve_dtypes
.
Since the function is named maybe_to_categorical
and accepts argument array
, the change seems like making perfect sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment to the doc-string that this is only an internal method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I prefix it with an underscore? It's used in a single place only.
pandas/tests/test_multilevel.py
Outdated
@@ -2381,6 +2381,39 @@ def test_iloc_mi(self): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.fixture | |||
def sparse_df(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these to pandas/tests/sparse/test_reshape.py
(new file); you can move other tests if appropriate (e.g. reshaping ones as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular sparse reshaping tests you had in mind? There's one test_stack_sparse_frame
in sparse.test_frame
.
pandas/core/internals.py
Outdated
@@ -4066,6 +4076,27 @@ def canonicalize(block): | |||
return all(block.equals(oblock) | |||
for block, oblock in zip(self_blocks, other_blocks)) | |||
|
|||
def unstack(self, unstacker): | |||
"""Return blockmanager with all blocks unstacked""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand the doc string a bit (IOW document what unstacker is supposed to be)
pandas/core/internals.py
Outdated
new_blocks.extend(blk._unstack(new_values.T, new_placement)) | ||
|
||
bm = BlockManager(new_blocks, [new_columns, new_index]) | ||
bm = bm.take(mask_columns.nonzero()[0], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this take for? should be inside the loop on bocks (IOW this is on each block)
pandas/core/internals.py
Outdated
mask_columns = np.zeros_like(new_columns, dtype=bool) | ||
|
||
for blk in self.blocks: | ||
bunstacker = unstacker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would push almost all of the logic inside this loop into the _unstack(...)
call on the block itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I managed something, but I wouldn't say it's much better now. It's still couples block unstacking with unstacked BlockManager's new_columns
, and I'm not really sure how to get rid of that.
can you rebase / update according to comments |
45cf34d
to
fcba171
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!. just some comments on doc-strings and such. the more can document the better :> ping when pushed / green.
pandas/core/internals.py
Outdated
@@ -1463,6 +1464,20 @@ def equals(self, other): | |||
return False | |||
return array_equivalent(self.values, other.values) | |||
|
|||
def _unstack(self, unstacker_t, new_columns): | |||
"""Return a list of unstacked blocks of self""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string here.
also name unstacker_t
-> unstacker_func
pandas/core/internals.py
Outdated
@@ -1706,6 +1721,22 @@ def _slice(self, slicer): | |||
def _try_cast_result(self, result, dtype=None): | |||
return result | |||
|
|||
def _unstack(self, unstacker_t, new_columns): | |||
# NonConsolidatable blocks can have a single item only, so we return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc-string
pandas/core/internals.py
Outdated
@@ -4161,6 +4192,34 @@ def canonicalize(block): | |||
return all(block.equals(oblock) | |||
for block, oblock in zip(self_blocks, other_blocks)) | |||
|
|||
def unstack(self, unstacker_t): | |||
"""Return a blockmanager with all blocks unstacked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change unstacker_t as above
is this comment addressed: #15045 (comment) (ok to raise / push off to another issue as well). |
actually can you add a mini reshaping section to sparse.rst and put a pointer in the whatsnew? |
fcba171
to
d235492
Compare
It is not addressed (nor touched). Stacking Series acts as if the fill value is NaN.
Added. What would you have me put into it? |
this generally looks ok, can you rebase and i'll give a once over again. |
You sure the |
uh, ok to remove. I hadn't really reviewed this much recently. |
a36c602
to
8aad74c
Compare
Rebased. |
Hmm, the appveyor failure looks entirely unrelated https://ci.appveyor.com/project/pandas-dev/pandas/build/1.0.4825/job/4kpjb91d5bhe0ier#L1215 @jreback have you seen that before? |
My god, that code! Thank god it broke! |
@TomAugspurger yeah let's make an issue for that 'unrelated code'. no idea what's going on there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok just a small comment needed. ping on green.
pandas/core/categorical.py
Outdated
@@ -127,6 +127,8 @@ def maybe_to_categorical(array): | |||
""" coerce to a categorical if a series is given """ | |||
if isinstance(array, (ABCSeries, ABCCategoricalIndex)): | |||
return array._values | |||
elif isinstance(array, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment to the doc-string that this is only an internal method.
thanks @kernc always appreciate the nice PRs! |
git diff upstream/master --name-only -- '*.py' | flake8 --diff