[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366

changhiskhan · 2018-12-20T00:13:32Z

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.

closes ENH: (explode) Splitting a column content over multiple rows while duplicating other columns content to these rows #16538
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-12-20T01:05:58Z

Codecov Report

Merging #24366 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24366      +/-   ##
==========================================
- Coverage   92.29%   92.29%   -0.01%     
==========================================
  Files         162      162              
  Lines       51832    51843      +11     
==========================================
+ Hits        47839    47849      +10     
- Misses       3993     3994       +1

Flag	Coverage Δ
#multiple	`90.7% <100%> (ø)`	⬆️
#single	`42.97% <9.09%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.93% <100%> (+0.02%)`	⬆️
pandas/util/testing.py	`87.57% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6cf7d9...dbd7515. Read the comment docs.

codecov · 2018-12-20T01:05:59Z

Codecov Report

Merging #24366 into master will increase coverage by 50.33%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master   #24366       +/-   ##
===========================================
+ Coverage   41.96%   92.29%   +50.33%     
===========================================
  Files         180      162       -18     
  Lines       50718    51852     +1134     
===========================================
+ Hits        21283    47859    +26576     
+ Misses      29435     3993    -25442

Flag	Coverage Δ
#multiple	`90.7% <100%> (?)`
#single	`42.96% <9.09%> (+1%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.93% <100%> (+61.82%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/compat/__init__.py	`58.3% <0%> (-33.7%)`	⬇️
pandas/io/clipboard/clipboards.py	`28.23% <0%> (-6.55%)`	⬇️
pandas/plotting/_misc.py	`38.68% <0%> (-4.56%)`	⬇️
pandas/io/formats/console.py	`74.24% <0%> (-3.89%)`	⬇️
pandas/_libs/__init__.py	`100% <0%> (ø)`	⬆️
pandas/io/api.py	`100% <0%> (ø)`	⬆️
pandas/_libs/tslibs/__init__.py	`100% <0%> (ø)`	⬆️
pandas/core/api.py	`100% <0%> (ø)`	⬆️
... and 169 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f8b0b9f...0323ce3. Read the comment docs.

TomAugspurger

I think dplyr calls the separate. https://tidyr.tidyverse.org/reference/separate.html

Do you have a preference for explode over separate? If not, I would say matching dplyr's verb makes sense.

TomAugspurger · 2018-12-20T14:50:22Z

doc/source/reshaping.rst

+----------------------------
+
+.. ipython:: python
+   :suppress:


Maybe don't suppress this? I think users may want to see the input.

Makes sense. It'll be easier for them to just copy/paste into a repl to try it out.

TomAugspurger · 2018-12-20T14:51:04Z

pandas/core/frame.py

+            Convenience to split a string `col_name` before exploding
+        dtype : str or dtype, default None
+            Optionally coerce the dtype of exploded column
+-


Stray character here?

LOL, at first glance I thought that was a git diff acting weird. I clearly forgot to validate the docstring

TomAugspurger · 2018-12-20T14:52:12Z

pandas/core/frame.py

+        dtype : str or dtype, default None
+            Optionally coerce the dtype of exploded column
+-
+        Examples


Perhaps add a See Also linking to Series.str.split, Series.str.extract? Maybe others?

Are we interested in implementing the inverse operation (what dplyr calls unite: https://tidyr.tidyverse.org/reference/unite.html)?

Good idea. I'll add Series.str.split and Series.str.extract here. Which other ones do you think would be relevant?

unite: so it would be like a groupby.agg(list/concat) type of operation? I'm not opposed to it but I think there's no urgency since we haven't had much user demand. My guess is because it maps to groupby so it's more natural to think about than the reverse.

TomAugspurger · 2018-12-20T14:54:10Z

pandas/core/frame.py

+
+        Parameters
+        ----------
+        col_name : str


I think we've been moving towards always using the full column instead of col for parameter names.

how would we distinguish between the string name of the column and the column's data?

I guess there's enough context here. I'll change to column for consistency then. But in general I'm still curious what the conclusion was for the question above.

TomAugspurger

Another thought: does this make sense on a Series too? Right now it's frame only.

Benchmark failures are fixed in #24372.

TomAugspurger · 2018-12-20T14:55:54Z

pandas/core/frame.py

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
                    var_name=var_name, value_name=value_name,
                    col_level=col_level)

+    def explode(self, col_name, sep=None, dtype=None):
+        """
+        Create a new DataFrame where each element in each row


I think this has to be a single line.

Then you can have a multi-line extended summary. scripts/validate_docstrings pandas.core.frame.explode should print out all the issues.

ah ok. will fix. thanks

jreback · 2018-12-20T15:18:07Z

explode is a much better name here and more common i the sql world

changhiskhan · 2018-12-20T17:38:42Z

Yeah, Spark has also adopted the explode terminology. AFAICT the name comes from the fact that in full implementations you can pass in function to be applied to a column with arbitrary nested data and extracts a flat sequence to then be separated onto their own rows.

changhiskhan · 2018-12-20T17:51:24Z

re: Series.explode most of the use cases I've come across would use DataFrame.explode since I almost always still want to join it back to the rest of the data once exploded. That being said, I'm certainly not opposed to Series.explode, but my preference would be to keep this PR simple and add Series.explode if we see demand from the community.

WillAyd · 2018-12-20T17:52:31Z

pandas/core/frame.py

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
                    var_name=var_name, value_name=value_name,
                    col_level=col_level)

+    def explode(self, col_name, sep=None, dtype=None):
+        """
+        Create a new DataFrame where each element in each row


There's a few errors here which the validate_docstrings.py script will help identify for you

thanks. I'll push an update

WillAyd · 2018-12-20T17:53:52Z

pandas/core/frame.py

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
                    var_name=var_name, value_name=value_name,
                    col_level=col_level)

+    def explode(self, col_name, sep=None, dtype=None):


Hmm would this be better as a Series method? Requiring col_name as a parameter makes it so it only operates as such anyway, no?

Started reviewing before I saw your above comment. Still think this is better served as a Series method instead of a frame method with a required col_name argument.

I think this would fail in cases where col_name is not unique

At least in use cases I've seen, you'd want to join it back to the rest of the data right away. I wouldn't be opposed to having both if people ask for it, but only having it as a Series method is less useful IMO.

I get your point though I think it would be better to simply return an object that a user can join themselves rather than try to take care of the merging within the method.

It's entirely reasonable to expect this against a Series object, so not offering that I think makes for a more confusing API.

I agree it is an entirely plausible/reasonable scenario. However I would prefer to wait until we see people asking about Series.explode on github/mailing-list/stackoverflow to add that to Series. Otherwise if people only actually ever reach for explode in the context of a DataFrame then why bother having it in Series?

Note that this is also consistent with SQL / Spark APIs so I think it's unlikely for a lot of confusion to arise

I agree this should be a Series method.

WillAyd · 2018-12-20T17:54:23Z

pandas/tests/frame/test_reshape.py

@@ -918,6 +918,90 @@ def test_unstack_swaplevel_sortlevel(self, level):
        tm.assert_frame_equal(result, expected)


+def test_explode():


Can you parametrize this test instead?

That's usually my default preference but for this case it would mean putting a ton of things in a @pytest.mark.parametrize decorator, which didn't smell right. Maybe I'm missing a better pattern to parametrize though, can you point me to an example you think I should follow here?

There's quite a few options but I'd say generally you can even split this out into different test cases.

Normal functionality test

A _raises test

Empty / NA handling

ok, so then this is more about splitting it out into multiple tests rather than parametrize right?

ok, I split the test cases out and just pushed an update

WillAyd · 2018-12-20T18:33:11Z

pandas/tests/frame/test_reshape.py

+    df = pd.DataFrame([['foo,bar', 'x', 42],
+                       ['fizz,buzz', 'y', 43]],
+                      columns=columns)
+    rs = df.explode('a', sep=',')


Also use result and expected here instead of rs and xp

I think in the current context rs and xp has clear meaning and is more concise to read. IMO it's the same as naming a variable df instead of dataframe.

result and expected are practically the standard. You'll see that in other tests in the module / larger code base so best to stay consistent. Will appear less verbose once parametrized

I agree that consistency is good. However, I see a lot of different variations throughout the code base on tests (r, res, rs, result, results, res*, xp, e, xp, exp, exp*, expected, etc). In addition, other abbreviations throughout the codebase are equally clear in context:

dataframe -> df
series -> s
index -> idx
columns -> cols
number_of_something -> n
operator -> op
function -> fun

result and expected should far and away be the most widely used. While not 100% there using rs and xp is moving away from consistency so please change

I don't see reviewers up in arms about fun v f to denote a function, or n vs num, or s vs ser. Not everything needs to have a strictly enforced standard.

I am ok with the naming as long as it follows the module where the tests are. We are somewhat inconsistent across test modules. But in a single module should be consistent with the existing style.

We for sure prefer fully written out names:
result, expected, df, op, s are ok.
idx and cols are generally not. res, and xp are generally not

pep8speaks · 2018-12-20T19:02:34Z

Hello @changhiskhan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-27 18:30:25 UTC

changhiskhan · 2018-12-20T20:16:10Z

@TomAugspurger I'll rebase after you merge in #24372 to make sure the benchmarks pass

…ev#16538) Sometimes a values column is presented with list-like values on one row. Instead we may want to split each individual value onto its own row, keeping the same mapping to the other key columns. While it's possible to chain together existing pandas operations (in fact that's exactly what this implementation is) to do this, the sequence of operations is not obvious. By contrast this is available as a built-in operation in say Spark and is a fairly common use case.

jreback

is there a compelling reason to have this as a frame method? Makes more senses as a Series method.

jreback · 2018-12-21T17:06:32Z

asv_bench/benchmarks/reshape.py

+    params = [[100, 1000, 10000], [3, 5, 10]]
+
+    def setup(self, n_rows, max_list_length):
+        import string


imports at the top of the file

jreback · 2018-12-21T17:07:08Z

doc/source/whatsnew/v0.24.0.rst

@@ -31,6 +31,7 @@ New features
 - :func:`read_feather` now accepts ``columns`` as an argument, allowing the user to specify which columns should be read. (:issue:`24025`)
 - :func:`DataFrame.to_html` now accepts ``render_links`` as an argument, allowing the user to generate HTML with links to any URLs that appear in the DataFrame.
  See the :ref:`section on writing HTML <io.html>` in the IO docs for example usage. (:issue:`2679`)
+- :func:`DataFrame.explode` to split list-like values onto individual rows. See :ref:`section on Exploding list-like column <reshaping.html>` in docs for more information (:issue:`16538`)


will need a sub-section here, show a mini-example and also point to the docs (as you are doing)

jreback · 2018-12-21T17:07:18Z

pandas/core/frame.py

@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
                    var_name=var_name, value_name=value_name,
                    col_level=col_level)

+    def explode(self, col_name, sep=None, dtype=None):


I agree this should be a Series method.

jreback · 2019-01-03T23:12:35Z

@changhiskhan can you merge master and update

jreback · 2019-01-14T00:22:12Z

@changhiskhan can you merge master & update

WillAyd · 2019-02-06T03:33:40Z

Think this would be a welcome change but closing as stale. Ping if you'd like to continue

erfannariman · 2019-05-24T15:42:28Z

I spend some time on StackOverflow, and the need to be able to unnest/explode a list to rows is a question that comes by a lot. If you guys are interested, here are two SO posts that propose a range of solutions, that might be handy for you guys or bring up some new ideas.

First
Second

jreback · 2019-05-24T16:11:48Z

@erfpy this is a fine PR just needs some love
be the first to resubmit it

erfannariman · 2019-05-24T16:16:18Z

@jreback would love to, I have many ideas, also would like to contribute to extend the docs, just have a hard time how to do that through git honestly. Have to spend some more time on how to make a good PR first I guess.

jreback · 2019-06-27T18:28:38Z

reopening, I will see what I can do with this in next few days.

WillAyd · 2019-07-15T01:07:52Z

superseded by #27267

changhiskhan force-pushed the explode-dataframe branch 2 times, most recently from 2c6f058 to dbd7515 Compare December 20, 2018 00:26

changhiskhan force-pushed the explode-dataframe branch 2 times, most recently from caabc63 to 9e76b75 Compare December 20, 2018 06:56

TomAugspurger reviewed Dec 20, 2018

View reviewed changes

WillAyd requested changes Dec 20, 2018

View reviewed changes

WillAyd added the API Design label Dec 20, 2018

WillAyd requested changes Dec 20, 2018

View reviewed changes

changhiskhan force-pushed the explode-dataframe branch from 9e76b75 to 96c4525 Compare December 20, 2018 19:02

changhiskhan force-pushed the explode-dataframe branch from 96c4525 to bd49629 Compare December 20, 2018 20:59

changhiskhan force-pushed the explode-dataframe branch from bd49629 to 2138ef0 Compare December 20, 2018 21:32

jreback requested changes Dec 21, 2018

View reviewed changes

WillAyd closed this Feb 6, 2019

jreback reopened this Jun 27, 2019

Merge branch 'master' into PR_TOOL_MERGE_PR_24366

0323ce3

jreback mentioned this pull request Jul 6, 2019

ENH: Add Series method to explode a list-like column #27267

Merged

WillAyd closed this Jul 15, 2019

		@@ -918,6 +918,90 @@ def test_unstack_swaplevel_sortlevel(self, level):
		tm.assert_frame_equal(result, expected)


		def test_explode():

[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366

[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366

Conversation

changhiskhan commented Dec 20, 2018 • edited Loading

codecov bot commented Dec 20, 2018

Codecov Report

codecov bot commented Dec 20, 2018 • edited Loading

Codecov Report

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 20, 2018

changhiskhan commented Dec 20, 2018

changhiskhan commented Dec 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Dec 20, 2018 • edited Loading

Comment last updated at 2019-06-27 18:30:25 UTC

changhiskhan commented Dec 20, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 3, 2019

jreback commented Jan 14, 2019

WillAyd commented Feb 6, 2019

erfannariman commented May 24, 2019 • edited Loading

jreback commented May 24, 2019

erfannariman commented May 24, 2019

jreback commented Jun 27, 2019

WillAyd commented Jul 15, 2019

changhiskhan commented Dec 20, 2018 •

edited

Loading

codecov bot commented Dec 20, 2018 •

edited

Loading

pep8speaks commented Dec 20, 2018 •

edited

Loading

erfannariman commented May 24, 2019 •

edited

Loading