BUG: DataFrame.quantile with NaNs (GH14357) #14536

jorisvandenbossche · 2016-10-29T14:22:57Z

I first used nanpercentile when available (np >= 1.9), but this only can handle numeric data, not datetimes. So therefore included a custom _nanpercentile

I also added some tests regarding empty dataframes/series in comments, as these are still failing (but are also failing on 0.19.0)

np.percentile cannot handle a block with NaNs, and the masking approach only worked with regularly placed NaNs. Solution: when missing values are present, use np.nanpercentile when available, otherwise use np.percentile applied along the axis

codecov-io · 2016-10-29T16:57:22Z

Current coverage is 85.26% (diff: 86.95%)

Merging #14536 into master will decrease coverage by <.01%

@@             master     #14536   diff @@
==========================================
  Files           140        140          
  Lines         50672      50685    +13   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43208      43219    +11   
- Misses         7464       7466     +2   
  Partials          0          0

Powered by Codecov. Last update 7f5a45c...cdd247b

jreback · 2016-10-30T13:54:45Z

pandas/core/internals.py

-            if self.ndim > 1:
-                values = values.reshape(result_shape)
+        def _nanpercentile1D(values, mask, q, **kw):
+            values = values[~mask]


from pandas.types.common import is_scalar

jreback · 2016-10-30T13:57:27Z

pandas/core/internals.py

+                else:
+                    return np.array([self._na_value] * len(q),
+                                    dtype=values.dtype)
+


I might move some of this to to pandas.core.nanops (though you might have to move slightly more as that takes axis arg). Its esentially what you are doing here, but in a slightly more general framework. call it nanquantile (or nanpercentile)

The reason I did not put it there initially, was because this is less general as the current functions in the nanops module. For example, I pass here the mask alongside the values because datetimelike values are already converted to integers at this point (where the NaTs are filled) because np.percentile cannot deal with datetime-like values

@jreback Opinion about this?

I meant move ALL of this; the nanops do everything (based on dtype), are basically ufuncs per-dtype. Its ok for now if you want to merge (to fix the bug). But let's open a new issue to move this code. All of the rest of it is there (for other ops). We don't do very much inside the block managers, mainly just assemble blocks, actual calculations are pushed to other routines (numpy or pandas)

sounds good, will open a new issue (and one for the failing empty ones as well)

(cherry picked from commit 52f31d4)

Version 0.19.1 * tag 'v0.19.1': (43 commits) RLS: v0.19.1 DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573) [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545) DOC: rst fixes [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567) [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551) [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536) [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520) [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392) [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514) [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543) [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541) [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527) [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501) [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442) BLD: fix 3.4 build for cython to 0.24.1 [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv [Backport pandas-dev#14496] BLD: Support Cython 0.25 [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476) ...

jorisvandenbossche added 3 commits October 28, 2016 18:08

BUG: DataFrame.quantile with NaNs (GH14357)

4b5a766

np.percentile cannot handle a block with NaNs, and the masking approach only worked with regularly placed NaNs. Solution: when missing values are present, use np.nanpercentile when available, otherwise use np.percentile applied along the axis

deal with empty / all NaN

1c646d7

deal with non-consolidatable difference in ndim

baa7b84

jorisvandenbossche added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 29, 2016

jorisvandenbossche added this to the 0.19.1 milestone Oct 29, 2016

jreback reviewed Oct 30, 2016

View reviewed changes

use types.common.is_scalar instead of lib.isscalar

cdd247b

jorisvandenbossche merged commit 52f31d4 into pandas-dev:master Nov 2, 2016

This was referenced Nov 2, 2016

CLN: move nanpercentile functionality to nanops #14562

Closed

BUG: inconsistencies/errors in quantile on empty DataFrame #14564

Closed

jorisvandenbossche added a commit that referenced this pull request Nov 3, 2016

[Backport #14536] BUG: DataFrame.quantile with NaNs (GH14357) (#14536)

4c42422

(cherry picked from commit 52f31d4)

jreback mentioned this pull request Feb 20, 2017

Quantile fails when only NaNs on some rows/columns #15460

Closed

jbrockmendel mentioned this pull request Nov 13, 2021

TST: FIXMES in DataFrame.quantile tests #44437

Merged

4 tasks

jorisvandenbossche deleted the bug-quantile-nan2 branch November 15, 2021 07:32

rob-sil mentioned this pull request Aug 4, 2024

BUG: Handle floating point boundaries in qcut #59409

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.quantile with NaNs (GH14357) #14536

BUG: DataFrame.quantile with NaNs (GH14357) #14536

jorisvandenbossche commented Oct 29, 2016

codecov-io commented Oct 29, 2016 •

edited

Loading

jreback Oct 30, 2016

jreback Oct 30, 2016

jorisvandenbossche Oct 30, 2016

jorisvandenbossche Nov 2, 2016

jreback Nov 2, 2016

jorisvandenbossche Nov 2, 2016

BUG: DataFrame.quantile with NaNs (GH14357) #14536

BUG: DataFrame.quantile with NaNs (GH14357) #14536

Conversation

jorisvandenbossche commented Oct 29, 2016

codecov-io commented Oct 29, 2016 • edited Loading

Current coverage is 85.26% (diff: 86.95%)

jreback Oct 30, 2016

Choose a reason for hiding this comment

jreback Oct 30, 2016

Choose a reason for hiding this comment

jorisvandenbossche Oct 30, 2016

Choose a reason for hiding this comment

jorisvandenbossche Nov 2, 2016

Choose a reason for hiding this comment

jreback Nov 2, 2016

Choose a reason for hiding this comment

jorisvandenbossche Nov 2, 2016

Choose a reason for hiding this comment

codecov-io commented Oct 29, 2016 •

edited

Loading