-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074
REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074
Changes from 4 commits
0e4c58e
a2aa8aa
ce0649b
103d1fe
d20b0cb
453eaba
be740ad
e63c7f6
025a3d4
caf7be8
8d7ee1a
25e903b
5e30199
9d4566f
faed070
ea063e6
3f30cab
db8e866
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3921,16 +3921,19 @@ def _set_value( | |
Sets whether or not index/col interpreted as indexers | ||
""" | ||
try: | ||
# setitem will do validation that may raise TypeError, | ||
# ValueError, or LossySetitemError | ||
if takeable: | ||
series = self._ixs(col, axis=1) | ||
loc = index | ||
icol = col | ||
iindex = index | ||
else: | ||
series = self._get_item_cache(col) | ||
loc = self.index.get_loc(index) | ||
|
||
# setitem_inplace will do validation that may raise TypeError, | ||
# ValueError, or LossySetitemError | ||
series._mgr.setitem_inplace(loc, value) | ||
icol = self.columns.get_loc(col) | ||
iindex = self.index.get_loc(index) | ||
# error: Argument 2 to "column_setitem" of "BlockManager" has | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might be unnecessary with a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried that before adding the ignore, but I don't really understand why it didn't work There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose I didn't try before to only do the cast inside the takeable branch, because that now seems to work (note, I did |
||
# incompatible type "Union[Hashable, Sequence[Hashable]]"; | ||
# expected "Union[int, slice, ndarray[Any, Any]]" | ||
self._mgr.column_setitem(icol, iindex, value) # type: ignore[arg-type] | ||
self._clear_item_cache() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we make column_setitem always-inplace and make the clear_item_cache unecesssary? API-wise i think the always-inplace method is a lot nicer than the less predictable one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or maybe could make setitem_inplace ignore CoW since it is explicitly inplace? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The main use case of The case here is only for I could add a separate inplace version or add an inplace keyword to
Even something that is explicitly inplace from a usage perspective will need to take care of CoW in #46958 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I pushed a version with an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel do you have a preference on keeping this change with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually it seems we still need to catch TypeError, for example for a MultiIndex case where So only removed LossySetitemError for now (and added a test case for setting with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, can you add a comment about why TypeError is needed. what about ValueError? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will add a comment about TypeError. Now, this catching of a ValueError alraedy was here before, so I am hesitant to remove it without looking further in detail at it. I would prefer to leave that for another PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So there are some cases where setitem actually raises a ValueError, eg when setting with an array-like that is not of length 1 (in this case of scalar indexer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For example:
Without catching/reraising in iloc, the error would be "ValueError: setting an array element with a sequence", which is slightly less informative. I suppose we should actually see to make this handling consistent in the different code paths (so it directly raises the more informative error message in the first place), but that's out of scope for this PR. |
||
|
||
except (KeyError, TypeError, ValueError, LossySetitemError): | ||
# set using a non-recursive method & reset the cache | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,7 +52,6 @@ | |
from pandas.core.indexers import ( | ||
check_array_indexer, | ||
is_empty_indexer, | ||
is_exact_shape_match, | ||
is_list_like_indexer, | ||
is_scalar_indexer, | ||
length_of_indexer, | ||
|
@@ -1936,42 +1935,31 @@ def _setitem_single_column(self, loc: int, value, plane_indexer): | |
""" | ||
pi = plane_indexer | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
ser = self.obj._ixs(loc, axis=1) | ||
|
||
# perform the equivalent of a setitem on the info axis | ||
# as we have a null slice or a slice with full bounds | ||
# which means essentially reassign to the columns of a | ||
# multi-dim object | ||
# GH#6149 (null slice), GH#10408 (full bounds) | ||
if com.is_null_slice(pi) or com.is_full_slice(pi, len(self.obj)): | ||
ser = value | ||
self.obj._iset_item(loc, value) | ||
elif ( | ||
is_array_like(value) | ||
and is_exact_shape_match(ser, value) | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and len(value.shape) > 0 | ||
and self.obj.shape[0] == value.shape[0] | ||
and not is_empty_indexer(pi) | ||
): | ||
if is_list_like(pi): | ||
ser = value[np.argsort(pi)] | ||
value = value[np.argsort(pi)] | ||
else: | ||
# in case of slice | ||
ser = value[pi] | ||
value = value[pi] | ||
self.obj._iset_item(loc, value) | ||
else: | ||
# set the item, first attempting to operate inplace, then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this comment still accurate/helpful? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, not really I think (but also not before this PR). The logic of "first attempt to operate inplace, then falling back to casting if necessary" is contained within the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now, it was added in 573c063, and from that point of view: I would say the comment is still as accurate as it was when it was added :) But I think we can also remove it now, it's the general behaviour of loc/iloc to fallback to casting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be worth discussing the gameplan for after the deprecation inside _setitem_single_column is enforced. What I had in mind was that inside of doing the can_hold_element check as part of take_split_path, we would do a try/except falling back to the split path. Does that idea play nicely with what you're doing here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would think that enforcing the deprecation would mean to remove the checks for "null/full slice" and so stop using |
||
# falling back to casting if necessary; see | ||
# _whatsnew_130.notable_bug_fixes.setitem_column_try_inplace | ||
|
||
orig_values = ser._values | ||
ser._mgr = ser._mgr.setitem((pi,), value) | ||
|
||
if ser._values is orig_values: | ||
# The setitem happened inplace, so the DataFrame's values | ||
# were modified inplace. | ||
return | ||
self.obj._iset_item(loc, ser) | ||
return | ||
|
||
# reset the sliced object if unique | ||
self.obj._iset_item(loc, ser) | ||
self.obj._mgr.column_setitem(loc, plane_indexer, value) | ||
self.obj._clear_item_cache() | ||
|
||
def _setitem_single_block(self, indexer, value, name: str): | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -869,6 +869,20 @@ def iset( | |
self.arrays[mgr_idx] = value_arr | ||
return | ||
|
||
def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value) -> None: | ||
""" | ||
Set values ("setitem") into a single column (not setting the full column). | ||
|
||
This is a method on the ArrayManager level, to avoid creating an | ||
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`) | ||
""" | ||
arr = self.arrays[loc] | ||
# create temporary SingleArrayManager without ref to use setitem implementation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the "without ref" here why you're not using .iget? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like if you used .iget here then the method could be shared between AM/BM? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The "without ref" comment is from copy pasting this from the initial CoW PR #41878 (so should probably remove that here). But that is indeed the reason to not use |
||
mgr = SingleArrayManager([arr], [self._axes[0]]) | ||
new_mgr = mgr.setitem((idx,), value) | ||
# update existing ArrayManager in-place | ||
self.arrays[loc] = new_mgr.arrays[0] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the "into" in the docstring suggests that the setting should occur into the existing array, so we shouldn't need to set a new array. am i misunderstanding? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's right. But the So it certainly looks a bit confusing here, as it indeed seems that I am fully replacing the array for the column in question. In principle I could check if both arrays are identical, and only if that is not the case, do this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general I find it also a bit confusing in our internal API that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yah, id be open to a name change (separate PR) for that. this is related to why i like setitem_inplace |
||
|
||
def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None: | ||
""" | ||
Insert item at selected position. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1188,6 +1188,17 @@ def _iset_single( | |
self.blocks = new_blocks | ||
return | ||
|
||
def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value) -> None: | ||
""" | ||
Set values ("setitem") into a single column (not setting the full column). | ||
|
||
This is a method on the BlockManager level, to avoid creating an | ||
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`) | ||
""" | ||
col_mgr = self.iget(loc) | ||
new_mgr = col_mgr.setitem((idx,), value) | ||
self.iset(loc, new_mgr._block.values, inplace=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think you can use _setitem_single here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None: | ||
""" | ||
Insert item at selected position. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1082,7 +1082,7 @@ def test_setitem_partial_column_inplace(self, consolidate, using_array_manager): | |
tm.assert_numpy_array_equal(zvals, expected.values) | ||
assert np.shares_memory(zvals, df["z"]._values) | ||
if not consolidate: | ||
assert df["z"]._values is zvals | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #47074 (comment) for comment about this removal |
||
assert df["z"]._values.base is zvals.base | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this check could pass if both .base attrs are None There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, good catch (will update that). Now, I also not fully sure why this change was actually needed, will see if I can figure that out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, so before we had this piece of code in indexing.py (that was removed in this PR):
That ensures that it is actually still an identical array that is stored in Now this goes through Since the only important guarantee is that it shares the same data (which is already tested on the line above), I am inclined to just remove this additional Alternatively, I could do a similar check in |
||
|
||
def test_setitem_duplicate_columns_not_inplace(self): | ||
# GH#39510 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this comment refer to column_setitem instead of just setitem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also looks like the new method doesn't actually do the validation this comment refers to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an existing comment (originally a few lines below), but so I suppose this comment was actually already not up to date anymore.
So before this PR the comment was about
setitem_inplace
, and that also doesn't do any validation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setitem_inplace calls np_can_hold_element, which raises on failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I see. That is doing validation that can raise the LossySetitemError. So the new
column_setitem
callssetitem
, which will also callnp_can_hold_element
, but there catching the LossySetitemError and coercing to the target dtype if needed.That is something that the loc/iloc fallback below otherwise will also do, so I suppose this change is OK (but the comment is then indeed no longer correct, and we also don't need to catch LossySetitemError here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, possibly also TypeError and ValueError