Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366

Closed

Conversation

changhiskhan
Copy link
Contributor

@changhiskhan changhiskhan commented Dec 20, 2018

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.

@changhiskhan changhiskhan force-pushed the explode-dataframe branch 2 times, most recently from 2c6f058 to dbd7515 Compare December 20, 2018 00:26
@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #24366 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24366      +/-   ##
==========================================
- Coverage   92.29%   92.29%   -0.01%     
==========================================
  Files         162      162              
  Lines       51832    51843      +11     
==========================================
+ Hits        47839    47849      +10     
- Misses       3993     3994       +1
Flag Coverage Δ
#multiple 90.7% <100%> (ø) ⬆️
#single 42.97% <9.09%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.93% <100%> (+0.02%) ⬆️
pandas/util/testing.py 87.57% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6cf7d9...dbd7515. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 20, 2018

Codecov Report

Merging #24366 into master will increase coverage by 50.33%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24366       +/-   ##
===========================================
+ Coverage   41.96%   92.29%   +50.33%     
===========================================
  Files         180      162       -18     
  Lines       50718    51852     +1134     
===========================================
+ Hits        21283    47859    +26576     
+ Misses      29435     3993    -25442
Flag Coverage Δ
#multiple 90.7% <100%> (?)
#single 42.96% <9.09%> (+1%) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 96.93% <100%> (+61.82%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/compat/__init__.py 58.3% <0%> (-33.7%) ⬇️
pandas/io/clipboard/clipboards.py 28.23% <0%> (-6.55%) ⬇️
pandas/plotting/_misc.py 38.68% <0%> (-4.56%) ⬇️
pandas/io/formats/console.py 74.24% <0%> (-3.89%) ⬇️
pandas/_libs/__init__.py 100% <0%> (ø) ⬆️
pandas/io/api.py 100% <0%> (ø) ⬆️
pandas/_libs/tslibs/__init__.py 100% <0%> (ø) ⬆️
pandas/core/api.py 100% <0%> (ø) ⬆️
... and 169 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8b0b9f...0323ce3. Read the comment docs.

@changhiskhan changhiskhan force-pushed the explode-dataframe branch 2 times, most recently from caabc63 to 9e76b75 Compare December 20, 2018 06:56
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dplyr calls the separate. https://tidyr.tidyverse.org/reference/separate.html

Do you have a preference for explode over separate? If not, I would say matching dplyr's verb makes sense.

----------------------------

.. ipython:: python
:suppress:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe don't suppress this? I think users may want to see the input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It'll be easier for them to just copy/paste into a repl to try it out.

Convenience to split a string `col_name` before exploding
dtype : str or dtype, default None
Optionally coerce the dtype of exploded column
-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray character here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, at first glance I thought that was a git diff acting weird. I clearly forgot to validate the docstring

dtype : str or dtype, default None
Optionally coerce the dtype of exploded column
-
Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a See Also linking to Series.str.split, Series.str.extract? Maybe others?

Are we interested in implementing the inverse operation (what dplyr calls unite: https://tidyr.tidyverse.org/reference/unite.html)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add Series.str.split and Series.str.extract here. Which other ones do you think would be relevant?

unite: so it would be like a groupby.agg(list/concat) type of operation? I'm not opposed to it but I think there's no urgency since we haven't had much user demand. My guess is because it maps to groupby so it's more natural to think about than the reverse.


Parameters
----------
col_name : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've been moving towards always using the full column instead of col for parameter names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we distinguish between the string name of the column and the column's data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's enough context here. I'll change to column for consistency then. But in general I'm still curious what the conclusion was for the question above.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: does this make sense on a Series too? Right now it's frame only.

Benchmark failures are fixed in #24372.

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
var_name=var_name, value_name=value_name,
col_level=col_level)

def explode(self, col_name, sep=None, dtype=None):
"""
Create a new DataFrame where each element in each row
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has to be a single line.

Then you can have a multi-line extended summary. scripts/validate_docstrings pandas.core.frame.explode should print out all the issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok. will fix. thanks

@jreback
Copy link
Contributor

jreback commented Dec 20, 2018

explode is a much better name here and more common i the sql world

@changhiskhan
Copy link
Contributor Author

Yeah, Spark has also adopted the explode terminology. AFAICT the name comes from the fact that in full implementations you can pass in function to be applied to a column with arbitrary nested data and extracts a flat sequence to then be separated onto their own rows.

@changhiskhan
Copy link
Contributor Author

re: Series.explode most of the use cases I've come across would use DataFrame.explode since I almost always still want to join it back to the rest of the data once exploded. That being said, I'm certainly not opposed to Series.explode, but my preference would be to keep this PR simple and add Series.explode if we see demand from the community.

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
var_name=var_name, value_name=value_name,
col_level=col_level)

def explode(self, col_name, sep=None, dtype=None):
"""
Create a new DataFrame where each element in each row
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few errors here which the validate_docstrings.py script will help identify for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I'll push an update

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
var_name=var_name, value_name=value_name,
col_level=col_level)

def explode(self, col_name, sep=None, dtype=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm would this be better as a Series method? Requiring col_name as a parameter makes it so it only operates as such anyway, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing before I saw your above comment. Still think this is better served as a Series method instead of a frame method with a required col_name argument.

I think this would fail in cases where col_name is not unique

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in use cases I've seen, you'd want to join it back to the rest of the data right away. I wouldn't be opposed to having both if people ask for it, but only having it as a Series method is less useful IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point though I think it would be better to simply return an object that a user can join themselves rather than try to take care of the merging within the method.

It's entirely reasonable to expect this against a Series object, so not offering that I think makes for a more confusing API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is an entirely plausible/reasonable scenario. However I would prefer to wait until we see people asking about Series.explode on github/mailing-list/stackoverflow to add that to Series. Otherwise if people only actually ever reach for explode in the context of a DataFrame then why bother having it in Series?

Note that this is also consistent with SQL / Spark APIs so I think it's unlikely for a lot of confusion to arise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be a Series method.

@@ -918,6 +918,90 @@ def test_unstack_swaplevel_sortlevel(self, level):
tm.assert_frame_equal(result, expected)


def test_explode():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parametrize this test instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's usually my default preference but for this case it would mean putting a ton of things in a @pytest.mark.parametrize decorator, which didn't smell right. Maybe I'm missing a better pattern to parametrize though, can you point me to an example you think I should follow here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a few options but I'd say generally you can even split this out into different test cases.

  1. Normal functionality test
  2. A _raises test
  3. Empty / NA handling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so then this is more about splitting it out into multiple tests rather than parametrize right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I split the test cases out and just pushed an update

df = pd.DataFrame([['foo,bar', 'x', 42],
['fizz,buzz', 'y', 43]],
columns=columns)
rs = df.explode('a', sep=',')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use result and expected here instead of rs and xp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the current context rs and xp has clear meaning and is more concise to read. IMO it's the same as naming a variable df instead of dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result and expected are practically the standard. You'll see that in other tests in the module / larger code base so best to stay consistent. Will appear less verbose once parametrized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that consistency is good. However, I see a lot of different variations throughout the code base on tests (r, res, rs, result, results, res*, xp, e, xp, exp, exp*, expected, etc). In addition, other abbreviations throughout the codebase are equally clear in context:

dataframe -> df
series -> s
index -> idx
columns -> cols
number_of_something -> n
operator -> op
function -> fun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result and expected should far and away be the most widely used. While not 100% there using rs and xp is moving away from consistency so please change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see reviewers up in arms about fun v f to denote a function, or n vs num, or s vs ser. Not everything needs to have a strictly enforced standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with the naming as long as it follows the module where the tests are. We are somewhat inconsistent across test modules. But in a single module should be consistent with the existing style.

We for sure prefer fully written out names:
result, expected, df, op, s are ok.
idx and cols are generally not. res, and xp are generally not

@pep8speaks
Copy link

pep8speaks commented Dec 20, 2018

Hello @changhiskhan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-27 18:30:25 UTC

@changhiskhan
Copy link
Contributor Author

@TomAugspurger I'll rebase after you merge in #24372 to make sure the benchmarks pass

…ev#16538)

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a compelling reason to have this as a frame method? Makes more senses as a Series method.

params = [[100, 1000, 10000], [3, 5, 10]]

def setup(self, n_rows, max_list_length):
import string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports at the top of the file

@@ -31,6 +31,7 @@ New features
- :func:`read_feather` now accepts ``columns`` as an argument, allowing the user to specify which columns should be read. (:issue:`24025`)
- :func:`DataFrame.to_html` now accepts ``render_links`` as an argument, allowing the user to generate HTML with links to any URLs that appear in the DataFrame.
See the :ref:`section on writing HTML <io.html>` in the IO docs for example usage. (:issue:`2679`)
- :func:`DataFrame.explode` to split list-like values onto individual rows. See :ref:`section on Exploding list-like column <reshaping.html>` in docs for more information (:issue:`16538`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need a sub-section here, show a mini-example and also point to the docs (as you are doing)

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
var_name=var_name, value_name=value_name,
col_level=col_level)

def explode(self, col_name, sep=None, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be a Series method.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

@changhiskhan can you merge master and update

@jreback
Copy link
Contributor

jreback commented Jan 14, 2019

@changhiskhan can you merge master & update

@WillAyd
Copy link
Member

WillAyd commented Feb 6, 2019

Think this would be a welcome change but closing as stale. Ping if you'd like to continue

@WillAyd WillAyd closed this Feb 6, 2019
@erfannariman
Copy link
Member

erfannariman commented May 24, 2019

I spend some time on StackOverflow, and the need to be able to unnest/explode a list to rows is a question that comes by a lot. If you guys are interested, here are two SO posts that propose a range of solutions, that might be handy for you guys or bring up some new ideas.

@jreback
Copy link
Contributor

jreback commented May 24, 2019

@erfpy this is a fine PR just needs some love
be the first to resubmit it

@erfannariman
Copy link
Member

@jreback would love to, I have many ideas, also would like to contribute to extend the docs, just have a hard time how to do that through git honestly. Have to spend some more time on how to make a good PR first I guess.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

reopening, I will see what I can do with this in next few days.

@jreback jreback reopened this Jun 27, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

superseded by #27267

@WillAyd WillAyd closed this Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants