-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INT: provide helpers for accessing the values of DataFrame columns #33252
Changes from 2 commits
2afaaad
18203e8
b1434fe
8e29685
f472232
562635c
a25cac8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ | |
FrozenSet, | ||
Hashable, | ||
Iterable, | ||
Iterator, | ||
List, | ||
Optional, | ||
Sequence, | ||
|
@@ -2568,6 +2569,21 @@ def _ixs(self, i: int, axis: int = 0): | |
|
||
return result | ||
|
||
def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]: | ||
""" | ||
Get the values of the ith column (ndarray or ExtensionArray, as stored | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "ith" -> "i'th" |
||
in the Block) | ||
""" | ||
return self._data.iget_values(i) | ||
|
||
def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Union[np.ndarray, ExtensionArray]" -> "ArrayLike:" |
||
""" | ||
Iterate over the arrays of all columns in order. | ||
This returns the values as stored in the Block (ndarray or ExtensionArray). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think another newline? |
||
""" | ||
for i in range(len(self.columns)): | ||
yield self._ixs_values(i) | ||
|
||
def __getitem__(self, key): | ||
key = lib.item_from_zerodim(key) | ||
key = com.apply_if_callable(key, self) | ||
|
@@ -7926,8 +7942,12 @@ def _reduce( | |
|
||
assert filter_type is None or filter_type == "bool", filter_type | ||
|
||
dtype_is_dt = self.dtypes.apply( | ||
lambda x: is_datetime64_any_dtype(x) or is_period_dtype(x) | ||
dtype_is_dt = np.array( | ||
[ | ||
is_datetime64_any_dtype(values.dtype) or is_period_dtype(values.dtype) | ||
for values in self._iter_arrays() | ||
], | ||
dtype=bool, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the perf issue here is in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as much as im trying to avoid There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Both the
Indeed, that would be even more appropriate here. Now, I am happy to change it to that, but that's not really the core of this PR. I mainly wanted to add one useful case as an illustration, but mainly want the helper function for other PRs. I am happy to merge it without already using it somewhere as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When having a dataframe with only extension blocks, it's actually slower than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since we're just going to call |
||
) | ||
if numeric_only is None and name in ["mean", "median"] and dtype_is_dt.any(): | ||
warnings.warn( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ | |
from pandas.core.dtypes.missing import isna | ||
|
||
import pandas.core.algorithms as algos | ||
from pandas.core.arrays import ExtensionArray | ||
from pandas.core.arrays.sparse import SparseDtype | ||
from pandas.core.base import PandasObject | ||
from pandas.core.construction import extract_array | ||
|
@@ -1001,6 +1002,14 @@ def iget(self, i: int) -> "SingleBlockManager": | |
self.axes[1], | ||
) | ||
|
||
def iget_values(self, i: int) -> Union[np.ndarray, ExtensionArray]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ArrayLike |
||
""" | ||
Return the data for column i as the values (ndarray or ExtensionArray). | ||
""" | ||
block = self.blocks[self.blknos[i]] | ||
values = block.iget(self.blklocs[i]) | ||
return values | ||
|
||
def delete(self, item): | ||
""" | ||
Delete selected item (items if non-unique) in-place. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's name this _iget_values for consistency no? (ixs sounds like cross-section, e.g. rows which this is not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no
_get_values
on DataFrame, so for DataFrame that gives no consistency, but of course more consistent with the block method. Anyway, I don't really have a preference, either way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Union[np.ndarray, ExtensionArray]" should be "ArrayLike"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel do you have a preference for the name?