-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: str.extractall for several matches #11386
Conversation
never embed DataFrames within a Series! instead you could simply return a multi-indexed frame, e.g. the first level are the original indexes of the Series, the 2nd level are the number of matches the columns as you have them |
|
@tdhock further you would prob add a parameter on how to handle partial matches (e.g. the |
Thanks for the suggestions @jreback. Now extractall returns a DataFrame with a MultiIndex: >>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data_dict = {
... 'single': {
... "Dave":'[email protected]',
... "Toby":'[email protected]',
... "Maude":'[email protected]',
... },
... 'multiple': {
... "robAndSteve": '[email protected] some text [email protected]',
... "abcdef": '[email protected] some text [email protected] and [email protected]',
... },
... 'none': {
... "missing":np.nan,
... "empty":"",
... },
... }
>>> tuple_list = []
>>> subject_list = []
>>> for k1, d in data_dict.iteritems():
... for k2, subject in d.iteritems():
... k = (k1, k2)
... tuple_list.append(k)
... subject_list.append(subject)
...
>>> index = pd.MultiIndex.from_tuples(tuple_list, names=("matches", "subject"))
>>> Si = pd.Series(subject_list, index)
>>> named_pattern = r'''
... (?P<user>[a-z0-9]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> iresult = Si.str.extractall(named_pattern, re.VERBOSE)
>>> iresult
user domain tld
matches subject
single Dave dave google com
Maude maudelaperriere gmail com
Toby tdhock5 gmail com
multiple robAndSteve rob gmail com
robAndSteve steve gmail com
abcdef a b com
abcdef c d com
abcdef e f com
>>> S = pd.Series(subject_list)
>>> result = S.str.extractall(named_pattern, re.VERBOSE)
>>> result
user domain tld
0 dave google com
1 maudelaperriere gmail com
2 tdhock5 gmail com
3 rob gmail com
3 steve gmail com
4 a b com
4 c d com
4 e f com
>>> So then we can access all the matches for the subject string with rob and steve via: >>> iresult.loc["multiple", "robAndSteve"]
user domain tld
matches subject
multiple robAndSteve rob gmail com
robAndSteve steve gmail com
>>> result.loc[3]
user domain tld
3 rob gmail com
3 steve gmail com For subjects that have 0 matches I think it would be more consistent and user-friendly if the following would return a DataFrame with 0 rows rather than an exception. Is that possible using some index options? >>> result.loc[5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/core/indexing.py", line 1198, in __getitem__
return self._getitem_axis(key, axis=0)
File "pandas/core/indexing.py", line 1342, in _getitem_axis
self._has_valid_type(key, axis)
File "pandas/core/indexing.py", line 1304, in _has_valid_type
error()
File "pandas/core/indexing.py", line 1291, in error
(key, self.obj._get_axis_name(axis)))
KeyError: 'the label [5] is not in the [index]'
>>> iresult.loc["none", "empty"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/core/indexing.py", line 1196, in __getitem__
return self._getitem_tuple(key)
File "pandas/core/indexing.py", line 709, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "pandas/core/indexing.py", line 822, in _getitem_lowerdim
result = self._handle_lowerdim_multi_index_axis0(tup)
File "pandas/core/indexing.py", line 804, in _handle_lowerdim_multi_index_axis0
raise e1
KeyError: 'none'
>>> |
I further propose the >>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
... print df
...
user domain tld
0 dave google com
user domain tld
0 maudelaperriere gmail com
user domain tld
0 tdhock5 gmail com
user domain tld
0 rob gmail com
1 steve gmail com
user domain tld
0 a b com
1 c d com
2 e f com
Empty DataFrame
Columns: [user, domain, tld]
Index: []
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
... print df["domain"]
...
0 google
Name: domain, dtype: object
0 gmail
Name: domain, dtype: object
0 gmail
Name: domain, dtype: object
0 gmail
1 gmail
Name: domain, dtype: object
0 b
1 d
2 f
Name: domain, dtype: object
Series([], Name: domain, dtype: object)
Series([], Name: domain, dtype: object)
>>> |
@tdhock let's just keep it straightforward for now |
OK, in that case I deleted extractiter, and added some more tests and examples for extractall. |
|
||
>>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)") | ||
letter digit | ||
A a 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this NOT be a multi-index here? Having a duplicated index is not at all convenient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean. Can you please clarify?
When the input Series is not multi-indexed, there is no reason the output DataFrame should be. This is the same as the behavior of the standard extract
method:
>>> from pandas import Series
>>> S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
>>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
letter digit
A a 1
A a 2
B b 1
>>> S.str.extract("(?P<letter>[ab])(?P<digit>\d)")
letter digit
A a 1
B b 1
C NaN NaN
>>> e_df = S.str.extract("(?P<letter>[ab])(?P<digit>\d)")
>>> e_df.index
Index([u'A', u'B', u'C'], dtype='object')
>>> e_df.keys()
Index([u'letter', u'digit'], dtype='object')
>>> ea_df = S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
>>> ea_df.index
Index([u'A', u'A', u'B'], dtype='object')
>>> ea_df.keys()
Index([u'letter', u'digit'], dtype='object')
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.extract
returns a like-index object to the original.
the proposed .extractall
by definition will have duplicates of some of the index elements. This is very different. This by its ery nature should return a MultiIndex
(or if the input has a multi-index), then add a level.
can you rebase / squash and i'll take a look |
4140521
to
21bc58f
Compare
OK @jreback this is my first time doing a rebase / squash. Did I do it correctly? |
yes that looks right |
|
|
||
>>> S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)") | ||
letter digit | ||
match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have the Index
name from the original Series (could be None
), for the first level, the 2nd level is ok as match
. pls add tests for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the name of the index is taken from the original Series, and it could be None.
In [3]: S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
In [5]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
Out[5]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 NaN 1
In [6]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)").index
Out[6]:
MultiIndex(levels=[[u'A', u'B', u'C'], [0, 1]],
labels=[[0, 0, 1, 2], [0, 1, 0, 0]],
names=[None, u'match'])
In [7]: Sn = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
In [10]: Sn.index.name = "capital"
In [12]: Sn.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
Out[12]:
letter digit
capital match
A 0 a 1
1 a 2
B 0 b 1
C 0 NaN 1
Do you think I should add a name to the Series used in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary, just make sure have a test for the name. This is now the default to check names (on series/frame comparisons)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, indeed there is already a test for a subject Series
with a named index.
TODO update docstrings |
TODOs
|
.str functions are all tested in test--categorical - only the ones that need args are special cased |
S = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) | ||
S.str.extract("[ab](?P<digit>\d)") | ||
|
||
the ``extractall`` method (introduced in version 0.18) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a versionadded
tag here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
d32db93
to
89be755
Compare
I rebased with master and removed the duplications in the whatsnew file. I ran the tests on my machine but I am getting many error which are unrelated to my PR
|
did you rebuild the extensions, e.g. |
thanks for the tip. After rebuilding the extensions all tests pass on my machine. |
@@ -201,9 +207,106 @@ and optional groups like | |||
|
|||
.. ipython:: python | |||
|
|||
pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)') | |||
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make extract
and extractall
sub-sections (I think you might have to use ^^^^
) as the sub-headings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@tdhock just a couple of doc changes. ping when pushed and green and we'll merge. |
groups_or_na = _groups_or_na_fun(regex) | ||
|
||
if regex.groups == 1: | ||
result = np.array([groups_or_na(val)[0] for val in arr], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groups_or_na(subject)
should be easier to understand than f(subject)
OK @jreback I think I have addressed all your concerns. |
@tdhock thanks! great PR! and you put up with our comments! only last thing: http://pandas-docs.github.io/pandas-docs-travis/ will have the built docs (may take a bit of time as Travis is sometimes queued). This builds all docs & doc-strings. Have a look and pls issue a followup-PR if anything needs clarification / formatting. |
This PR clarifies the new documentation for extract and extractall. It was requested by @jreback in #11386 (comment) Author: Toby Dylan Hocking <[email protected]> Closes #12281 from tdhock/extract-docs and squashes the following commits: 2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications
Author: Toby Dylan Hocking <[email protected]> Closes pandas-dev#11386 from tdhock/extractall and squashes the following commits: 0c1c3d1 [Toby Dylan Hocking] ENH: extract(expand), extractall
This PR clarifies the new documentation for extract and extractall. It was requested by @jreback in pandas-dev#11386 (comment) Author: Toby Dylan Hocking <[email protected]> Closes pandas-dev#12281 from tdhock/extract-docs and squashes the following commits: 2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications
For a series
S
, the excellentS.str.extract
method returns the first match in each subject of the series:That's great, but sometimes we want to extract all matches in each element of the series. You can do that with
S.str.findall
but its result does not include the names specified in the capturing groups of the regular expression:I propose the
S.str.extractall
method which returns aSeries
the same length as the subjectS
. Each element of the series is aDataFrame
with a row for each match and a column for each group:Before I write any more testing code, can we start a discussion about whether or not this is an acceptable design choice, in relation to the other functionality of pandas? @sinhrks @jorisvandenbossche @jreback @mortada since you seem to be discussing extract in #10103
Also do you have any ideas about how to get the result (a Series of DataFrames) to print more nicely? With my current fork we have
In R the equivalent functionality is provided by the https://github.com/tdhock/namedCapture package (str_match_all_named returns a list of data.frames), and the resulting printout is readable because of the way that R prints lists: