Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT: provide helpers for accessing the values of DataFrame columns #33252

Merged
merged 7 commits into from
Apr 10, 2020
24 changes: 22 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
FrozenSet,
Hashable,
Iterable,
Iterator,
List,
Optional,
Sequence,
Expand Down Expand Up @@ -2568,6 +2569,21 @@ def _ixs(self, i: int, axis: int = 0):

return result

def _ixs_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's name this _iget_values for consistency no? (ixs sounds like cross-section, e.g. rows which this is not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no _get_values on DataFrame, so for DataFrame that gives no consistency, but of course more consistent with the block method. Anyway, I don't really have a preference, either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Union[np.ndarray, ExtensionArray]" should be "ArrayLike"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel do you have a preference for the name?

"""
Get the values of the ith column (ndarray or ExtensionArray, as stored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ith" -> "i'th"

in the Block)
"""
return self._data.iget_values(i)

def _iter_arrays(self) -> Iterator[Union[np.ndarray, ExtensionArray]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Union[np.ndarray, ExtensionArray]" -> "ArrayLike:"

"""
Iterate over the arrays of all columns in order.
This returns the values as stored in the Block (ndarray or ExtensionArray).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think another newline?

"""
for i in range(len(self.columns)):
yield self._ixs_values(i)

def __getitem__(self, key):
key = lib.item_from_zerodim(key)
key = com.apply_if_callable(key, self)
Expand Down Expand Up @@ -7926,8 +7942,12 @@ def _reduce(

assert filter_type is None or filter_type == "bool", filter_type

dtype_is_dt = self.dtypes.apply(
lambda x: is_datetime64_any_dtype(x) or is_period_dtype(x)
dtype_is_dt = np.array(
[
is_datetime64_any_dtype(values.dtype) or is_period_dtype(values.dtype)
for values in self._iter_arrays()
],
dtype=bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the perf issue here is in the self.dtypes call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the perf issue here is in the self.dtypes call?

Both the self.dtypes that creates a Series, as the apply that creates another Series with inference.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Indeed, that would be even more appropriate here.

Now, I am happy to change it to that, but that's not really the core of this PR. I mainly wanted to add one useful case as an illustration, but mainly want the helper function for other PRs. I am happy to merge it without already using it somewhere as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

When having a dataframe with only extension blocks, it's actually slower than [v.dtype for v in df._iter_arrays()] (for a dataframe with a 2D block it will of course be faster)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're just going to call .any() on this anyway, could do the any up-front on the iterator, no big deal

)
if numeric_only is None and name in ["mean", "median"] and dtype_is_dt.any():
warnings.warn(
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from pandas.core.dtypes.missing import isna

import pandas.core.algorithms as algos
from pandas.core.arrays import ExtensionArray
from pandas.core.arrays.sparse import SparseDtype
from pandas.core.base import PandasObject
from pandas.core.construction import extract_array
Expand Down Expand Up @@ -1001,6 +1002,14 @@ def iget(self, i: int) -> "SingleBlockManager":
self.axes[1],
)

def iget_values(self, i: int) -> Union[np.ndarray, ExtensionArray]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayLike

"""
Return the data for column i as the values (ndarray or ExtensionArray).
"""
block = self.blocks[self.blknos[i]]
values = block.iget(self.blklocs[i])
return values

def delete(self, item):
"""
Delete selected item (items if non-unique) in-place.
Expand Down