-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578
WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578
Conversation
isinstance(arr, np.ndarray) | ||
and is_datetime64_dtype(arr.dtype) | ||
and is_scalar(value) | ||
and isna(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use is_valid_nat_for_dtype, otherwise this will incorrectly let through td64 nat
(i know i know, this is just draft and youre not looking for comments like this, but this could easily fall through the cracks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is certainly a comment that is useful ;) (I was actually planning to comment on this and ask you, because I already realized the same would need to be done for timedelta, that just didn't occur in the tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in this case we explicitly want to include np.nan (which I think is_valid_nat_for_dtype
doesn't do?), since we allow setting np.nan or None in a datetime column, but a datetime64[ns] numpy array doesn't allow this (and the same for timedelta64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which I think is_valid_nat_for_dtype doesn't do?
incorrect
# TODO what is this used for? | ||
# def setitem(self, indexer, value) -> ArrayManager: | ||
# return self.apply_with_block("setitem", indexer=indexer, value=value) | ||
def setitem(self, indexer, value, column_idx) -> ArrayManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block.setitem isn't always inplace, its a try-inplace-fallback-to-cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but not in ArrayManager (I know I need to open an issue about that to discuss this in general -> #39584)
@@ -454,7 +471,7 @@ def is_mixed_type(self) -> bool: | |||
|
|||
@property | |||
def is_numeric_mixed_type(self) -> bool: | |||
return False | |||
return all(is_numeric_dtype(t) for t in self.get_dtypes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the "mixed" be meaningful here, i.e. what if we're homogeneous dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it will also return True, like it does for BlockManager (this basically only wants to know that all columns are numeric, while not necessarily being all the same dtype)
tm.assert_frame_equal(df, exp) | ||
|
||
def test_loc_setitem_single_row_categorical(self): | ||
if using_array_manager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this desired behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending outcome of the discussion in #39584, the "a" column has integer dtype and should preserve that, so string categorical cannot be set
@@ -579,6 +596,8 @@ def test_setitem_cast(self, float_frame): | |||
float_frame["something"] = 2.5 | |||
assert float_frame["something"].dtype == np.float64 | |||
|
|||
@td.skip_array_manager_invalid_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on why the test is invalid? (maybe skip_array_manager_invalid_test could be callable with a "reason" keyword)
assert df["timestamp"].dtype == np.object_ | ||
assert df.loc["b", "timestamp"] == iNaT | ||
if not using_array_manager: | ||
# TODO(ArrayManager) setting iNaT in DatetimeArray actually sets NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one way to avoid this is be using ensure_wrapped_if_datetimelike in ArrayManager.setitem to ensure you have a DTA and never a ndarray[dt64]
elif isinstance(indexer, (ABCSeries, ABCIndex, np.ndarray, list)): | ||
if isinstance(indexer, list): | ||
elif isinstance(indexer, (ABCSeries, ABCIndex, np.ndarray, list, range)): | ||
if isinstance(indexer, (list, range)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the range case can be implemented separately + more efficiently, similar to the slice case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that a range object should have been converted to a slice before calling length_of_indexer
?
At the moment it's simply the _setitem_with_indexer_split_path
case that calls that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but it's probably because of calling self.iloc._setitem_with_indexer(..)
in frame.py instead of self.iloc[..] = ..
that the conversion doesn't happen anymore (iloc.__setitem__
calls indexer = self._get_setitem_indexer(key)
before passing the indexer to _setitem_with_indexer
and the _get_setitem_indexer
converts a range object in a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was just thinking (indexer.start - indexer.stop) // indexer.step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this not work?
""" | ||
DataFrame with 3 level MultiIndex (year, month, day) covering | ||
first 100 business days from 2000-01-01 with random data | ||
""" | ||
if using_array_manager: | ||
# TODO(ArrayManager) groupby | ||
pytest.skip("Not yet implemented for ArrayManager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the decorator be used for this?
@@ -1797,7 +1799,7 @@ def _setitem_with_indexer_frame_value(self, indexer, value: DataFrame, name: str | |||
|
|||
self._setitem_single_column(loc, val, pi) | |||
|
|||
def _setitem_single_column(self, loc: int, value, plane_indexer): | |||
def _setitem_single_column(self, loc: int, value, plane_indexer, overwrite=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotate
@@ -1728,7 +1730,7 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str): | |||
|
|||
# scalar value | |||
for loc in ilocs: | |||
self._setitem_single_column(loc, value, pi) | |||
self._setitem_single_column(loc, value, pi, overwrite=name == "setitem") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe define overwrite once earlier on?
@@ -1684,7 +1684,9 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str): | |||
|
|||
elif len(ilocs) == 1 and lplane_indexer == len(value) and not is_scalar(pi): | |||
# We are setting multiple rows in a single column. | |||
self._setitem_single_column(ilocs[0], value, pi) | |||
self._setitem_single_column( | |||
ilocs[0], value, pi, overwrite=name == "setitem" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think we ever get name=="setitem" ATM, maybe update the docstring to describe what this indicates?
This has some overlap with #39510; would that solve this issue? |
@jorisvandenbossche is #40380 helpful for this? if not im going to mothball it |
Sorry for the slow reply here. I need to look into it again, but in any case, a part of the test changes of this PR was broken off and is already in master (#40323, #40325). Another part of the code changes here were related to making loc and iloc update in place (#39584) I need to look again at #40380 to know if it would be helpful for this. (I generally plan to look at the indexing code starting next week, for the copy-on-write exploration) |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
this is quite old, happen to reopen if actively worked on. |
xref indexing work item of #39146
This actually started as trying to get the tests passing for
tests/frame/indexing
, but while doing that I needed to make decisions on what the behaviour should be for setitem / loc / iloc assignment (so related other ongoing discussions about this).Very much a draft, but a good way to see what we actually want / the consequences of those choices.
cc @jbrockmendel