Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT: provide helpers for accessing the values of DataFrame columns #33252

Merged
merged 7 commits into from
Apr 10, 2020
35 changes: 32 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
FrozenSet,
Hashable,
Iterable,
Iterator,
List,
Optional,
Sequence,
Expand All @@ -40,7 +41,16 @@
from pandas._config import get_option

from pandas._libs import algos as libalgos, lib, properties
from pandas._typing import Axes, Axis, Dtype, FilePathOrBuffer, Label, Level, Renamer
from pandas._typing import (
ArrayLike,
Axes,
Axis,
Dtype,
FilePathOrBuffer,
Label,
Level,
Renamer,
)
from pandas.compat import PY37
from pandas.compat._optional import import_optional_dependency
from pandas.compat.numpy import function as nv
Expand Down Expand Up @@ -2573,6 +2583,21 @@ def _ixs(self, i: int, axis: int = 0):

return result

def _get_column_array(self, i: int) -> ArrayLike:
"""
Get the values of the i'th column (ndarray or ExtensionArray, as stored
in the Block)
"""
return self._data.iget_values(i)

def _iter_column_arrays(self) -> Iterator[ArrayLike]:
"""
Iterate over the arrays of all columns in order.
This returns the values as stored in the Block (ndarray or ExtensionArray).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think another newline?

"""
for i in range(len(self.columns)):
yield self._get_column_array(i)

def __getitem__(self, key):
key = lib.item_from_zerodim(key)
key = com.apply_if_callable(key, self)
Expand Down Expand Up @@ -8031,8 +8056,12 @@ def _reduce(

assert filter_type is None or filter_type == "bool", filter_type

dtype_is_dt = self.dtypes.apply(
lambda x: is_datetime64_any_dtype(x) or is_period_dtype(x)
dtype_is_dt = np.array(
[
is_datetime64_any_dtype(values.dtype) or is_period_dtype(values.dtype)
for values in self._iter_column_arrays()
],
dtype=bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the perf issue here is in the self.dtypes call?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the perf issue here is in the self.dtypes call?

Both the self.dtypes that creates a Series, as the apply that creates another Series with inference.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

Indeed, that would be even more appropriate here.

Now, I am happy to change it to that, but that's not really the core of this PR. I mainly wanted to add one useful case as an illustration, but mainly want the helper function for other PRs. I am happy to merge it without already using it somewhere as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as im trying to avoid ._data, might be cleaner to use self._data.get_dtypes here

When having a dataframe with only extension blocks, it's actually slower than [v.dtype for v in df._iter_arrays()] (for a dataframe with a 2D block it will of course be faster)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're just going to call .any() on this anyway, could do the any up-front on the iterator, no big deal

)
if numeric_only is None and name in ["mean", "median"] and dtype_is_dt.any():
warnings.warn(
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -984,6 +984,14 @@ def iget(self, i: int) -> "SingleBlockManager":
self.axes[1],
)

def iget_values(self, i: int) -> ArrayLike:
"""
Return the data for column i as the values (ndarray or ExtensionArray).
"""
block = self.blocks[self.blknos[i]]
values = block.iget(self.blklocs[i])
return values

def idelete(self, indexer):
"""
Delete selected locations in-place (new block and array, same BlockManager)
Expand Down