-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Series.combine() fails with ExtensionArray inside of Series #21183
Changes from 13 commits
7469ca9
bbb6640
339b23a
61a09e7
d862e83
4c925fc
27480ac
f96372e
677fe18
9fceee7
1010cb5
aceea9f
79506ac
0e4720b
2a21117
e08f832
d3ed2c7
4ca28b2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -181,6 +181,7 @@ Reshaping | |
Other | ||
^^^^^ | ||
|
||
- | ||
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) | ||
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this one can go in reshaping bug fixes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or into EA bug fixes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In EA bug fixes |
||
- | ||
- |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2196,7 +2196,7 @@ def _binop(self, other, func, level=None, fill_value=None): | |
result.name = None | ||
return result | ||
|
||
def combine(self, other, func, fill_value=np.nan): | ||
def combine(self, other, func, fill_value=None): | ||
""" | ||
Perform elementwise binary operation on two Series using given function | ||
with optional fill value when an index is missing from one Series or | ||
|
@@ -2208,6 +2208,8 @@ def combine(self, other, func, fill_value=np.nan): | |
func : function | ||
Function that takes two scalars as inputs and return a scalar | ||
fill_value : scalar value | ||
The default specifies to use the appropriate NaN value for | ||
the underlying dtype of the Series | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this need a versionchanged? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be no change in behaviour for normal Series I think, as the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @jorisvandenbossche |
||
|
||
Returns | ||
------- | ||
|
@@ -2227,20 +2229,31 @@ def combine(self, other, func, fill_value=np.nan): | |
Series.combine_first : Combine Series values, choosing the calling | ||
Series's values first | ||
""" | ||
self_is_ext = is_extension_array_dtype(self.values) | ||
if fill_value is None: | ||
fill_value = na_value_for_dtype(self.dtype, False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use the keyword for False so it’s clearer what’s being specified. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, that is done. |
||
|
||
if isinstance(other, Series): | ||
new_index = self.index.union(other.index) | ||
new_name = ops.get_op_result_name(self, other) | ||
new_values = np.empty(len(new_index), dtype=self.dtype) | ||
for i, idx in enumerate(new_index): | ||
new_values = [] | ||
for idx in new_index: | ||
lv = self.get(idx, fill_value) | ||
rv = other.get(idx, fill_value) | ||
with np.errstate(all='ignore'): | ||
new_values[i] = func(lv, rv) | ||
new_values.append(func(lv, rv)) | ||
else: | ||
new_index = self.index | ||
with np.errstate(all='ignore'): | ||
new_values = func(self._values, other) | ||
new_values = [func(lv, other) for lv in self._values] | ||
new_name = self.name | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you put a comment on what is going on here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
if self_is_ext and not is_categorical_dtype(self.values): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am puzzled by all of these element-by-element operations. Is there a reason you don't simply define There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good to me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls don't use
why is the try/except here? that is very odd There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed the if test as you suggested. @jreback wrote:
The idea is to first try to coerce to the same type as the extension dtype, and if that doesn't work, just call the regular constructor for the Series. |
||
try: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so what kind of op hits the type error here? I find this a bit disconcerting that you need to catch a TypeError? is this a case that the combine op returns a result which is not an extension type (e.g. say its an int or someting), is that the reason? if so pls indicate via a comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback Here is an example:
Since the function passed to combine is an arbitrary function, it could return a result of any type, which may not be the type that will fit in the EA. I'll add a comment. |
||
new_values = self._values._from_sequence(new_values) | ||
except TypeError: | ||
pass | ||
|
||
return self._constructor(new_values, index=new_index, name=new_name) | ||
|
||
def combine_first(self, other): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -103,3 +103,36 @@ def test_factorize_equivalence(self, data_for_grouping, na_sentinel): | |
|
||
tm.assert_numpy_array_equal(l1, l2) | ||
self.assert_extension_array_equal(u1, u2) | ||
|
||
def test_combine_le(self, data_repeated): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you give a 1-liner explaining what this is testing. the name of the test is uninformative. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
# GH 20825 | ||
orig_data1, orig_data2 = data_repeated(2) | ||
s1 = pd.Series(orig_data1) | ||
s2 = pd.Series(orig_data2) | ||
result = s1.combine(s2, lambda x1, x2: x1 <= x2) | ||
expected = pd.Series([a <= b for (a, b) in | ||
zip(list(orig_data1), list(orig_data2))]) | ||
self.assert_series_equal(result, expected) | ||
|
||
val = s1.iloc[0] | ||
result = s1.combine(val, lambda x1, x2: x1 <= x2) | ||
expected = pd.Series([a <= val for a in list(orig_data1)]) | ||
self.assert_series_equal(result, expected) | ||
|
||
def test_combine_add(self, data_repeated): | ||
# GH 20825 | ||
orig_data1, orig_data2 = data_repeated(2) | ||
s1 = pd.Series(orig_data1) | ||
s2 = pd.Series(orig_data2) | ||
result = s1.combine(s2, lambda x1, x2: x1 + x2) | ||
expected = pd.Series( | ||
orig_data1._from_sequence([a + b for (a, b) in | ||
zip(list(orig_data1), | ||
list(orig_data2))])) | ||
self.assert_series_equal(result, expected) | ||
|
||
val = s1.iloc[0] | ||
result = s1.combine(val, lambda x1, x2: x1 + x2) | ||
expected = pd.Series( | ||
orig_data1._from_sequence([a + val for a in list(orig_data1)])) | ||
self.assert_series_equal(result, expected) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,15 @@ def data_missing(): | |
return Categorical([np.nan, 'A']) | ||
|
||
|
||
@pytest.fixture | ||
def data_repeated(): | ||
"""Return different versions of data for count times""" | ||
def gen(count): | ||
for _ in range(count): | ||
yield Categorical(make_data()) | ||
yield gen | ||
|
||
|
||
@pytest.fixture | ||
def data_for_sorting(): | ||
return Categorical(['A', 'B', 'C'], categories=['C', 'A', 'B'], | ||
|
@@ -154,6 +163,10 @@ class TestMethods(base.BaseMethodsTests): | |
def test_value_counts(self, all_data, dropna): | ||
pass | ||
|
||
@pytest.mark.skip(reason="combine add for categorical not supported") | ||
def test_combine_add(self, data_repeated): | ||
pass | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it raise? if so pls tests that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is interesting. Since
In pandas 0.23.0, this fails. But the binary operation is defined on an element by element basis, so the new implementation will return:
IMHO, this is correct, because of what |
||
|
||
|
||
class TestCasting(base.BaseCastingTests): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move the ExtensionArray to a sub-section (create a new one) as lots of changes in EA (you can make a new bug fix section).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made a bug fix section