-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.quantile with NaNs (GH14357) #14536
BUG: DataFrame.quantile with NaNs (GH14357) #14536
Conversation
np.percentile cannot handle a block with NaNs, and the masking approach only worked with regularly placed NaNs. Solution: when missing values are present, use np.nanpercentile when available, otherwise use np.percentile applied along the axis
Current coverage is 85.26% (diff: 86.95%)@@ master #14536 diff @@
==========================================
Files 140 140
Lines 50672 50685 +13
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43208 43219 +11
- Misses 7464 7466 +2
Partials 0 0
|
if self.ndim > 1: | ||
values = values.reshape(result_shape) | ||
def _nanpercentile1D(values, mask, q, **kw): | ||
values = values[~mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.types.common import is_scalar
else: | ||
return np.array([self._na_value] * len(q), | ||
dtype=values.dtype) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might move some of this to to pandas.core.nanops
(though you might have to move slightly more as that takes axis arg). Its esentially what you are doing here, but in a slightly more general framework. call it nanquantile
(or nanpercentile
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I did not put it there initially, was because this is less general as the current functions in the nanops
module. For example, I pass here the mask
alongside the values because datetimelike values are already converted to integers at this point (where the NaTs are filled) because np.percentile
cannot deal with datetime-like values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Opinion about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant move ALL of this; the nanops do everything (based on dtype), are basically ufuncs per-dtype. Its ok for now if you want to merge (to fix the bug). But let's open a new issue to move this code. All of the rest of it is there (for other ops). We don't do very much inside the block managers, mainly just assemble blocks, actual calculations are pushed to other routines (numpy or pandas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will open a new issue (and one for the failing empty ones as well)
Version 0.19.1 * tag 'v0.19.1': (43 commits) RLS: v0.19.1 DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573) [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545) DOC: rst fixes [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567) [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551) [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536) [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520) [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392) [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514) [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543) [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541) [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527) [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501) [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442) BLD: fix 3.4 build for cython to 0.24.1 [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv [Backport pandas-dev#14496] BLD: Support Cython 0.25 [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476) ...
Closes #14357
I first used
nanpercentile
when available (np >= 1.9), but this only can handle numeric data, not datetimes. So therefore included a custom_nanpercentile
I also added some tests regarding empty dataframes/series in comments, as these are still failing (but are also failing on 0.19.0)