Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix/test SparseSeries/SparseDataFrame stack/unstack #16616

Merged
merged 8 commits into from
Sep 26, 2017

Conversation

kernc
Copy link
Contributor

@kernc kernc commented Jun 6, 2017

@kernc kernc force-pushed the sparse-stack-unstack branch 4 times, most recently from 88656a8 to 05aefcb Compare June 7, 2017 10:32
@codecov
Copy link

codecov bot commented Jun 7, 2017

Codecov Report

Merging #16616 into master will decrease coverage by 0.01%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16616      +/-   ##
==========================================
- Coverage   91.26%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49776    49761      -15     
==========================================
- Hits        45426    45405      -21     
- Misses       4350     4356       +6
Flag Coverage Δ
#multiple 89.04% <94.44%> (ø) ⬆️
#single 40.33% <42.59%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/categorical.py 95.29% <100%> (+0.02%) ⬆️
pandas/core/reshape/reshape.py 99.27% <100%> (-0.02%) ⬇️
pandas/core/internals.py 94.37% <92.3%> (+0.08%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/accessor.py 93.65% <0%> (-6.35%) ⬇️
pandas/core/tools/datetimes.py 83.79% <0%> (-1.42%) ⬇️
pandas/core/indexes/interval.py 92.85% <0%> (-0.72%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.15%) ⬇️
pandas/core/indexes/multi.py 96.9% <0%> (ø) ⬆️
pandas/io/excel.py 80.37% <0%> (ø) ⬆️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d43aba8...cef5f47. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic looks fine. see if you can move this to internals to follow existing style.

@@ -52,6 +52,9 @@ Bug Fixes
- Bug in :func:`cut` when ``labels`` are set, resulting in incorrect label ordering (:issue:`16459`)
- Fixed a compatibility issue with IPython 6.0's tab completion showing deprecation warnings on ``Categoricals`` (:issue:`16409`)

- Bug in ``SparseSeries.unstack()`` and ``SparseDataFrame.stack()`` (:issue:`16614`, :issue:`15045`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.21.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change to use :func:SparseSeries.unstack (etc)

result = DataFrame(BlockManager(new_blocks, new_axes))
mask_frame = DataFrame(BlockManager(mask_blocks, new_axes))
return result.loc[:, mask_frame.sum(0) > 0]
# BlockManager can't handle SparseBlocks with multiple items,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to move this logic to a function in internals. e.g. you define BlockManager.unstack, then you have an unstack on Block. This way the override logic is pretty straightforward as its done on a block basis, and this is setup so that you can return multiple blocks (in case of sparse) from these routines.

you can pass the _Unstacker as a partial function, and so the entire loop looks something like

obj._data.unstack(_Unstacker)

then call the _Unstacker inside the block.unstack() method.

@jreback jreback added Bug Sparse Sparse Data Type labels Jun 8, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Jun 30, 2017
@@ -52,6 +52,9 @@ Bug Fixes
- Bug in :func:`cut` when ``labels`` are set, resulting in incorrect label ordering (:issue:`16459`)
- Fixed a compatibility issue with IPython 6.0's tab completion showing deprecation warnings on ``Categoricals`` (:issue:`16409`)

- Bug in ``SparseSeries.unstack()`` and ``SparseDataFrame.stack()`` (:issue:`16614`, :issue:`15045`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change to use :func:SparseSeries.unstack (etc)

if values.ndim == 1:
if isinstance(values, Categorical):
self.is_categorical = values
values = np.array(values)
elif self.is_sparse:
# XXX: Makes SparseArray *dense*, but it's supposedly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a TODO? or a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment with a mild TODO hint at whomever will be refactoring the whole thing eventually to take the note into consideration.

@@ -2381,6 +2381,29 @@ def test_iloc_mi(self):
tm.assert_frame_equal(result, expected)


class TestSparse(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use more pytest style here

Copy link
Contributor Author

@kernc kernc Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have. But with pytest style promoting flat, module-level functions only, the module namespace will get polluted fast and often, so the tests modules will likely benefit from becoming more fine grained.

@jreback jreback removed this from the 0.21.0 milestone Jun 30, 2017
@kernc kernc force-pushed the sparse-stack-unstack branch from 05aefcb to b7b76fd Compare July 11, 2017 16:03
@@ -127,6 +127,8 @@ def maybe_to_categorical(array):
""" coerce to a categorical if a series is given """
if isinstance(array, (ABCSeries, ABCCategoricalIndex)):
return array._values
elif isinstance(array, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in NonConsolidatingBlock._unstack() calling .make_block_same_class(), making a CategoricalBlock from unstacked values (an ndarray). Otherwise fails frame.test_reshape.TestDataFrameReshape.test_unstack_preserve_dtypes.
Since the function is named maybe_to_categorical and accepts argument array, the change seems like making perfect sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to the doc-string that this is only an internal method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I prefix it with an underscore? It's used in a single place only.

@@ -2381,6 +2381,39 @@ def test_iloc_mi(self):
tm.assert_frame_equal(result, expected)


@pytest.fixture
def sparse_df():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these to pandas/tests/sparse/test_reshape.py (new file); you can move other tests if appropriate (e.g. reshaping ones as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular sparse reshaping tests you had in mind? There's one test_stack_sparse_frame in sparse.test_frame.

@@ -4066,6 +4076,27 @@ def canonicalize(block):
return all(block.equals(oblock)
for block, oblock in zip(self_blocks, other_blocks))

def unstack(self, unstacker):
"""Return blockmanager with all blocks unstacked"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand the doc string a bit (IOW document what unstacker is supposed to be)

new_blocks.extend(blk._unstack(new_values.T, new_placement))

bm = BlockManager(new_blocks, [new_columns, new_index])
bm = bm.take(mask_columns.nonzero()[0], axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this take for? should be inside the loop on bocks (IOW this is on each block)

mask_columns = np.zeros_like(new_columns, dtype=bool)

for blk in self.blocks:
bunstacker = unstacker(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would push almost all of the logic inside this loop into the _unstack(...) call on the block itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I managed something, but I wouldn't say it's much better now. It's still couples block unstacking with unstacked BlockManager's new_columns, and I'm not really sure how to get rid of that.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

can you rebase / update according to comments

@kernc kernc force-pushed the sparse-stack-unstack branch from 45cf34d to fcba171 Compare July 22, 2017 13:55
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!. just some comments on doc-strings and such. the more can document the better :> ping when pushed / green.

@@ -1463,6 +1464,20 @@ def equals(self, other):
return False
return array_equivalent(self.values, other.values)

def _unstack(self, unstacker_t, new_columns):
"""Return a list of unstacked blocks of self"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string here.
also name unstacker_t -> unstacker_func

@@ -1706,6 +1721,22 @@ def _slice(self, slicer):
def _try_cast_result(self, result, dtype=None):
return result

def _unstack(self, unstacker_t, new_columns):
# NonConsolidatable blocks can have a single item only, so we return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc-string

@@ -4161,6 +4192,34 @@ def canonicalize(block):
return all(block.equals(oblock)
for block, oblock in zip(self_blocks, other_blocks))

def unstack(self, unstacker_t):
"""Return a blockmanager with all blocks unstacked.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change unstacker_t as above

@jreback
Copy link
Contributor

jreback commented Jul 22, 2017

is this comment addressed: #15045 (comment) (ok to raise / push off to another issue as well).

@jreback
Copy link
Contributor

jreback commented Jul 22, 2017

actually can you add a mini reshaping section to sparse.rst and put a pointer in the whatsnew?

@jreback jreback added this to the 0.21.0 milestone Jul 22, 2017
@kernc kernc force-pushed the sparse-stack-unstack branch from fcba171 to d235492 Compare August 11, 2017 13:29
@kernc
Copy link
Contributor Author

kernc commented Aug 11, 2017

is this comment addressed: #15045 (comment)

It is not addressed (nor touched). Stacking Series acts as if the fill value is NaN.

actually can you add a mini reshaping section to sparse.rst

Added. What would you have me put into it?

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

this generally looks ok, can you rebase and i'll give a once over again.

@kernc
Copy link
Contributor Author

kernc commented Sep 23, 2017

You sure the sparse.reshaping docs section is OK? I was leaning towards removing it as methods on sparse objects should generally produce equivalent results to methods on their super counterparts, with any exceptions in fact being bugs.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

You sure the sparse.reshaping docs section is OK? I was leaning towards removing it as methods on sparse objects should generally produce equivalent results to methods on their super counterparts, with any exceptions in fact being bugs.

uh, ok to remove. I hadn't really reviewed this much recently.

@kernc kernc force-pushed the sparse-stack-unstack branch from a36c602 to 8aad74c Compare September 23, 2017 20:56
@kernc
Copy link
Contributor Author

kernc commented Sep 25, 2017

Rebased.

@TomAugspurger
Copy link
Contributor

Hmm, the appveyor failure looks entirely unrelated https://ci.appveyor.com/project/pandas-dev/pandas/build/1.0.4825/job/4kpjb91d5bhe0ier#L1215

@jreback have you seen that before?

@kernc
Copy link
Contributor Author

kernc commented Sep 25, 2017

entirely unrelated

My god, that code! Thank god it broke!

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

@TomAugspurger yeah let's make an issue for that 'unrelated code'. no idea what's going on there.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok just a small comment needed. ping on green.

@@ -127,6 +127,8 @@ def maybe_to_categorical(array):
""" coerce to a categorical if a series is given """
if isinstance(array, (ABCSeries, ABCCategoricalIndex)):
return array._values
elif isinstance(array, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to the doc-string that this is only an internal method.

@jreback jreback merged commit d3be81a into pandas-dev:master Sep 26, 2017
@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

thanks @kernc always appreciate the nice PRs!

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
3 participants