-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366
Conversation
2c6f058
to
dbd7515
Compare
Codecov Report
@@ Coverage Diff @@
## master #24366 +/- ##
==========================================
- Coverage 92.29% 92.29% -0.01%
==========================================
Files 162 162
Lines 51832 51843 +11
==========================================
+ Hits 47839 47849 +10
- Misses 3993 3994 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24366 +/- ##
===========================================
+ Coverage 41.96% 92.29% +50.33%
===========================================
Files 180 162 -18
Lines 50718 51852 +1134
===========================================
+ Hits 21283 47859 +26576
+ Misses 29435 3993 -25442
Continue to review full report at Codecov.
|
caabc63
to
9e76b75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dplyr calls the separate. https://tidyr.tidyverse.org/reference/separate.html
Do you have a preference for explode over separate? If not, I would say matching dplyr's verb makes sense.
doc/source/reshaping.rst
Outdated
---------------------------- | ||
|
||
.. ipython:: python | ||
:suppress: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe don't suppress this? I think users may want to see the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. It'll be easier for them to just copy/paste into a repl to try it out.
pandas/core/frame.py
Outdated
Convenience to split a string `col_name` before exploding | ||
dtype : str or dtype, default None | ||
Optionally coerce the dtype of exploded column | ||
- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray character here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL, at first glance I thought that was a git diff acting weird. I clearly forgot to validate the docstring
dtype : str or dtype, default None | ||
Optionally coerce the dtype of exploded column | ||
- | ||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a See Also linking to Series.str.split
, Series.str.extract
? Maybe others?
Are we interested in implementing the inverse operation (what dplyr calls unite
: https://tidyr.tidyverse.org/reference/unite.html)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll add Series.str.split and Series.str.extract here. Which other ones do you think would be relevant?
unite
: so it would be like a groupby.agg(list/concat) type of operation? I'm not opposed to it but I think there's no urgency since we haven't had much user demand. My guess is because it maps to groupby so it's more natural to think about than the reverse.
|
||
Parameters | ||
---------- | ||
col_name : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've been moving towards always using the full column
instead of col
for parameter names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we distinguish between the string name of the column and the column's data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's enough context here. I'll change to column
for consistency then. But in general I'm still curious what the conclusion was for the question above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: does this make sense on a Series too? Right now it's frame only.
Benchmark failures are fixed in #24372.
pandas/core/frame.py
Outdated
@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None, | |||
var_name=var_name, value_name=value_name, | |||
col_level=col_level) | |||
|
|||
def explode(self, col_name, sep=None, dtype=None): | |||
""" | |||
Create a new DataFrame where each element in each row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has to be a single line.
Then you can have a multi-line extended summary. scripts/validate_docstrings pandas.core.frame.explode
should print out all the issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok. will fix. thanks
explode is a much better name here and more common i the sql world |
Yeah, Spark has also adopted the explode terminology. AFAICT the name comes from the fact that in full implementations you can pass in function to be applied to a column with arbitrary nested data and extracts a flat sequence to then be separated onto their own rows. |
re: |
pandas/core/frame.py
Outdated
@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None, | |||
var_name=var_name, value_name=value_name, | |||
col_level=col_level) | |||
|
|||
def explode(self, col_name, sep=None, dtype=None): | |||
""" | |||
Create a new DataFrame where each element in each row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few errors here which the validate_docstrings.py script will help identify for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I'll push an update
@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None, | |||
var_name=var_name, value_name=value_name, | |||
col_level=col_level) | |||
|
|||
def explode(self, col_name, sep=None, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm would this be better as a Series method? Requiring col_name
as a parameter makes it so it only operates as such anyway, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started reviewing before I saw your above comment. Still think this is better served as a Series method instead of a frame method with a required col_name argument.
I think this would fail in cases where col_name
is not unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in use cases I've seen, you'd want to join it back to the rest of the data right away. I wouldn't be opposed to having both if people ask for it, but only having it as a Series method is less useful IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point though I think it would be better to simply return an object that a user can join themselves rather than try to take care of the merging within the method.
It's entirely reasonable to expect this against a Series object, so not offering that I think makes for a more confusing API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is an entirely plausible/reasonable scenario. However I would prefer to wait until we see people asking about Series.explode
on github/mailing-list/stackoverflow to add that to Series. Otherwise if people only actually ever reach for explode
in the context of a DataFrame
then why bother having it in Series?
Note that this is also consistent with SQL / Spark APIs so I think it's unlikely for a lot of confusion to arise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be a Series method.
pandas/tests/frame/test_reshape.py
Outdated
@@ -918,6 +918,90 @@ def test_unstack_swaplevel_sortlevel(self, level): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
def test_explode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize this test instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's usually my default preference but for this case it would mean putting a ton of things in a @pytest.mark.parametrize
decorator, which didn't smell right. Maybe I'm missing a better pattern to parametrize though, can you point me to an example you think I should follow here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's quite a few options but I'd say generally you can even split this out into different test cases.
- Normal functionality test
- A
_raises
test - Empty / NA handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so then this is more about splitting it out into multiple tests rather than parametrize right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I split the test cases out and just pushed an update
pandas/tests/frame/test_reshape.py
Outdated
df = pd.DataFrame([['foo,bar', 'x', 42], | ||
['fizz,buzz', 'y', 43]], | ||
columns=columns) | ||
rs = df.explode('a', sep=',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also use result
and expected
here instead of rs
and xp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the current context rs
and xp
has clear meaning and is more concise to read. IMO it's the same as naming a variable df
instead of dataframe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result
and expected
are practically the standard. You'll see that in other tests in the module / larger code base so best to stay consistent. Will appear less verbose once parametrized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that consistency is good. However, I see a lot of different variations throughout the code base on tests (r, res, rs, result, results, res*, xp, e, xp, exp, exp*, expected, etc). In addition, other abbreviations throughout the codebase are equally clear in context:
dataframe -> df
series -> s
index -> idx
columns -> cols
number_of_something -> n
operator -> op
function -> fun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result
and expected
should far and away be the most widely used. While not 100% there using rs
and xp
is moving away from consistency so please change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see reviewers up in arms about fun
v f
to denote a function, or n
vs num
, or s
vs ser
. Not everything needs to have a strictly enforced standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with the naming as long as it follows the module where the tests are. We are somewhat inconsistent across test modules. But in a single module should be consistent with the existing style.
We for sure prefer fully written out names:
result, expected, df, op, s are ok.
idx and cols are generally not. res, and xp are generally not
9e76b75
to
96c4525
Compare
Hello @changhiskhan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-27 18:30:25 UTC |
@TomAugspurger I'll rebase after you merge in #24372 to make sure the benchmarks pass |
96c4525
to
bd49629
Compare
…ev#16538) Sometimes a values column is presented with list-like values on one row. Instead we may want to split each individual value onto its own row, keeping the same mapping to the other key columns. While it's possible to chain together existing pandas operations (in fact that's exactly what this implementation is) to do this, the sequence of operations is not obvious. By contrast this is available as a built-in operation in say Spark and is a fairly common use case.
bd49629
to
2138ef0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a compelling reason to have this as a frame method? Makes more senses as a Series method.
params = [[100, 1000, 10000], [3, 5, 10]] | ||
|
||
def setup(self, n_rows, max_list_length): | ||
import string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports at the top of the file
@@ -31,6 +31,7 @@ New features | |||
- :func:`read_feather` now accepts ``columns`` as an argument, allowing the user to specify which columns should be read. (:issue:`24025`) | |||
- :func:`DataFrame.to_html` now accepts ``render_links`` as an argument, allowing the user to generate HTML with links to any URLs that appear in the DataFrame. | |||
See the :ref:`section on writing HTML <io.html>` in the IO docs for example usage. (:issue:`2679`) | |||
- :func:`DataFrame.explode` to split list-like values onto individual rows. See :ref:`section on Exploding list-like column <reshaping.html>` in docs for more information (:issue:`16538`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need a sub-section here, show a mini-example and also point to the docs (as you are doing)
@@ -5980,6 +5980,49 @@ def melt(self, id_vars=None, value_vars=None, var_name=None, | |||
var_name=var_name, value_name=value_name, | |||
col_level=col_level) | |||
|
|||
def explode(self, col_name, sep=None, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be a Series method.
@changhiskhan can you merge master and update |
@changhiskhan can you merge master & update |
Think this would be a welcome change but closing as stale. Ping if you'd like to continue |
@erfpy this is a fine PR just needs some love |
@jreback would love to, I have many ideas, also would like to contribute to extend the docs, just have a hard time how to do that through git honestly. Have to spend some more time on how to make a good PR first I guess. |
reopening, I will see what I can do with this in next few days. |
superseded by #27267 |
Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case.
git diff upstream/master -u -- "*.py" | flake8 --diff