Parametrized NA sentinel for factorize #20473

TomAugspurger · 2018-03-23T19:12:53Z

Adds a new keyword na_value to control the NA sentinel inside the factorize
routine.

In [3]: arr = np.array([0, 1, 0, 2], dtype='u8')

In [4]: pd.factorize(arr)
Out[4]: (array([0, 1, 0, 2]), array([0, 1, 2], dtype=uint64))

In [5]: pd.factorize(arr, na_value=0)
Out[5]: (array([-1,  0, -1,  1]), array([1, 2], dtype=uint64))

The basic idea is that our hashtables now have two "modes" for detecting NA values

The previous way, based on the templated na_condition
Equality to a user-specified NA value.

Note: specifying NA value does not "turn off" the old way of checking NAs.

In [2]: pd.factorize(np.array([0.0, np.nan, 1.0]), na_value=0.0)
Out[2]: (array([-1, -1,  0]), array([1.]))

I'm sure the implementation can be cleaned up, but I wanted to put this up for feedback. Hopefully someone can tell me how to avoid the use_default_na nonsense :)

cc @WillAyd

closes #20328

Adds a new keyword `na_value` to control the NA sentinel inside the factorize routine. ```python In [3]: arr = np.array([0, 1, 0, 2], dtype='u8') In [4]: pd.factorize(arr) Out[4]: (array([0, 1, 0, 2]), array([0, 1, 2], dtype=uint64)) In [5]: pd.factorize(arr, na_value=0) Out[5]: (array([-1, 0, -1, 1]), array([1, 2], dtype=uint64)) ```

TomAugspurger · 2018-03-23T19:22:54Z

pandas/_libs/hashtable_class_helper.pxi.in


 cdef class {{name}}HashTable(HashTable):

-    def __cinit__(self, size_hint=1):
+    def __cinit__(self, size_hint=1, {{dtype}}_t na_value={{default_na_value}},
+                  bint use_na_value=False):


I wonder, should this just be moved to get_labels?

WillAyd · 2018-03-23T20:03:35Z

pandas/_libs/hashtable_class_helper.pxi.in

        self.table = kh_init_{{dtype}}()
+        self.na_value = na_value
+        self.use_na_value = use_na_value


Do you even need use_na_value at all? Conceptually can't you just rely on the existing check_null implementation and simply use whatever na_value gets set to (whether default or provided by user)?

I'm not sure :) Without use_na_value, wouldn't that break

pd.factorize(np.array([0, 1], dtype='u8'))

On master the labels [0, 1]. But without use_na_value, we'd see that 0 is equal to the default UInt64HashTable na_value of 0, so the labels would be [-1, 0].

I'll try that now to make sure.

Hmm OK I see. I'll say this with the caveat that I haven't used factorize before, but does it even make sense to have a NA value for boolean types? Wouldn't the presence of NA force that conversion to float?

I think the best use-case to focus on is Categorical. In its codes, missing values are -1. So we'd like to pass an array of int64 to factorize, and have it think that -1 is NA.

We avoid NaN for datetime / categorical for that reason: it would force them to float. For the Int64HashTable, we use iNaT (the smallest int) as the missing value sentinel.

Ah OK thanks - that clears things up a lot. Shouldn't the default missing value for the type then be -1 instead of 0?

For Int64 or UInt64? If it's UInt64 it'll have to be 0, since -1 is negative :)

I need to better describe default_na_value. It's never actually used, since the only way it could be used is if the user explicitly passes na_value. Essentially, I want

cdef f(int64_t x=None): if x is not None: ...

But in a way that makes cython happy. There's probably a way to do that...

Good point on the -1 with the unsigned int - you'll have to excuse me (it's Friday).

So perhaps one thing to consider is instead of declaring x to be int64_t in your example (or whatever type is dictated by the template) you could keep it as an object in the signature. In the cdef block add the declaration for your variable, and then do your if x is not None construct in the body before the nogil, where within that conditional you could cast the object to the particular type and assign it to the variable you declared in the cdef section.

Kind of wordy so let me know if that's not clear

Something like

diff --git a/pandas/_libs/hashtable_class_helper.pxi.in b/pandas/_libs/hashtable_class_helper.pxi.in index af60eb4b1..07bdad241 100644 --- a/pandas/_libs/hashtable_class_helper.pxi.in +++ b/pandas/_libs/hashtable_class_helper.pxi.in @@ -409,26 +409,32 @@ cdef class {{name}}HashTable(HashTable): def get_labels(self, {{dtype}}_t[:] values, {{name}}Vector uniques, Py_ssize_t count_prior, Py_ssize_t na_sentinel, bint check_null=True, - {{dtype}}_t na_value={{default_na_value}}, - bint use_na_value=False): + object na_value=None): cdef: Py_ssize_t i, n = len(values) int64_t[:] labels Py_ssize_t idx, count = count_prior int ret = 0 - {{dtype}}_t val + {{dtype}}_t val, na_value2 khiter_t k {{name}}VectorData *ud + bint use_na_value labels = np.empty(n, dtype=np.int64) ud = uniques.data + use_na_value = na_value is not None + + if use_na_value: + na_value2 = <{{dtype}}_t>na_value + else: + na_value2 = {{default_na_value}} with nogil: for i in range(n): val = values[i] if ((check_null and {{null_condition}}) or - (use_na_value and val == na_value)): + (use_na_value and val == na_value2)): labels[i] = na_sentinel continue diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index bbf60c2ee..e668b4748 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -456,16 +456,9 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None, """ (hash_klass, vec_klass), values = _get_data_algo(values, _hashtables) - use_na_value = na_value is not None - kwargs = dict(use_na_value=use_na_value) - - if use_na_value: - kwargs['na_value'] = na_value - table = hash_klass(size_hint or len(values)) uniques = vec_klass() - labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls, - **kwargs) + labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls, na_value=na_value) labels = _ensure_platform_int(labels) uniques = uniques.to_array()

seems to work. Is that what you had in mind?

Just to be a little more explicit something along the lines of

cdef f(object x=None): cdef: {{template_type}} x_val if x is not None: x_val = <{{template_type}}> x ... with nogil: ...

Should still work because you aren't trying to access any Python objects within the nogil - the compiler should know that x_val is of type {{template_type}} and work with that accordingly, even though that value doesn't get defined until runtime

Replied at about the same time but yes that's what I was thinking. Good to hear it works

WillAyd · 2018-03-23T20:04:15Z

pandas/_libs/hashtable_class_helper.pxi.in

-          ('Int64', 'int64', 'val == iNaT', False)]
+# name, dtype, null_condition, float_group, default_na_value
+dtypes = [('Float64', 'float64', 'val != val', True, 'nan'),
+          ('UInt64', 'uint64', 'False', False, 0),


I realize you didn't set this, but is the uint64 track for bool hashing? If so, any reason why we don't use uint8 to save space?

WillAyd · 2018-03-23T20:07:47Z

pandas/_libs/hashtable_class_helper.pxi.in


        with nogil:
            for i in range(n):
                val = values[i]

-                if check_null and {{null_condition}}:
+                if ((check_null and {{null_condition}}) or


Related to above comment, is it not possible to do something like

if (check_null and (val != val or val == na_value)):

To still cover all of the scenarios? (borrowed the above logic from groupby_helper)

codecov · 2018-03-23T20:37:12Z

Codecov Report

Merging #20473 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #20473   +/-   ##
=======================================
  Coverage   91.82%   91.82%           
=======================================
  Files         152      152           
  Lines       49235    49235           
=======================================
  Hits        45212    45212           
  Misses       4023     4023

Flag	Coverage Δ
#multiple	`90.21% <100%> (ø)`	⬆️
#single	`41.89% <87.5%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/dtypes/missing.py	`91.07% <ø> (ø)`	⬆️
pandas/core/arrays/categorical.py	`96.19% <100%> (-0.02%)`	⬇️
pandas/core/algorithms.py	`94.31% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c0c277...465d458. Read the comment docs.

TomAugspurger · 2018-03-23T21:48:27Z

Thanks to @WillAyd we've cleaned up some of the hackiness.

Now I'd like to see if we can remove some of the strangeness around the two ways of having nulls, {{null_condition}} or a user-specified NA value.

And more tests.

WillAyd · 2018-03-23T23:21:34Z

@TomAugspurger did the below not work to consolidate the "two null methods"?

if (check_null and (val != val or val == na_value))

If so we could remove null_condition from the templating altogether (would make it more consistent with groupby). I don't think you need to have use_na_value as part of the conditional because that should be setting the appropriate na_value earlier in the method (maybe?)

Removed null_condition from the template. Had Categorical use na_value

TomAugspurger · 2018-03-24T01:30:36Z

👍 I had to modify check_null to be True whenever na_value is specified.

I want to add more tests around the behavior of specifying an na_value and existing NA values (NaN, NaT, None, etc.). Do we agree with the following behavior?

In [17]: pd.factorize([0.0, np.nan, 1.0], na_value=0.0)
Out[17]: (array([-1, -1,  0]), array([1.]))

i.e. NaN will always be NA. This is currently broken for NaT I think.

In [19]: pd.factorize(pd.DatetimeIndex([pd.Timestamp(0), pd.Timestamp(1), None]), na_value=0)
Out[19]:
(array([-1,  0,  1]),
 DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None))

That should be [-1, 0, -1] for the labels I think...

WillAyd · 2018-03-24T02:38:11Z

I actually think the first example should raise a ValueError, if only because I don't see a reason why a person would intentionally provide a na_value in an array that contains NA. The reasoning for that is unclear and I think it would be better to signal to the user that something may be awry (they probably wanted to fill the missing value first?)

For the second example, I also would prefer a ValueError gets thrown when the type of na_value does not match the type of the values being factorized (not applicable for object dtypes). pd.Timestamp(0) == 0 is False so I wouldn't expect factorize to roll its own equality comp.

Ignoring that, depending on how you feel about my stance on example one the second example should either raise a ValueError or go [-1, 0, -1] as you mentioned. Just my $.02

TomAugspurger · 2018-03-24T11:30:03Z

Thinking more on this, I think we should not make the na_value parameter public to end users. I think it should only be available in the private _factorize_array method. Two reasons

Some awkwardness around calling pd.factorize(Categorical, na_value=foo). That "works", but the na_value is ignored since the chain is pd.factorize -> Categorical.factorize -> pd.factorize, and Categorical.factorize doesn't take na_value so it's dropped.
The only concrete use-cases we have is extension arrays (including internal ones). In this case, we won't be mixing multiple missing markers. We can either tell EA authors that _factorize_array is public to them, or just ask them "what's your NA value" and pass it to _factorize_array in the default implementation for them.

TomAugspurger · 2018-03-24T11:54:56Z

Updated to move the parameter just to _factorize_array, and not the public factorize. I've tested a variant of #20361 built on this. It works well for JSONArray, IPArray (object hashtable) and MACArray (uint64 hashtable). And of course Categorical, which I've changed here to use na_value.

…orize-na-value

Added PyObject hashtable test

jreback · 2018-03-24T22:18:08Z

pandas/core/algorithms.py

@@ -445,19 +446,27 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
    values : ndarray
    check_nulls : bool
        Whether to check for nulls in the hashtable's 'get_labels' method.
+        Nulls are always checked when `na_value` is specified.


if you are going to add this kwarg (which is ok), then remove check_nulls which is redundant.

Thought I posted a comment about that, but I guess I deleted it.

AFAICT check_nulls is only useful for DatetimeIndex, so that these two are different?

In [2]: pd.factorize([2**-63, 0]) Out[2]: (array([0, 1]), array([1.08420217e-19, 0.00000000e+00])) In [3]: pd.factorize(pd.DatetimeIndex([None, 0])) Out[3]: (array([-1, 0]), DatetimeIndex(['1970-01-01'], dtype='datetime64[ns]', freq=None))

Once DatetimeIndex is an ExtensionArray, pd.factorize will dispatch to DatetimeIndex.factorize, which can call _factorize_array(..., na_value=iNaT). I think I can do this now, it just makes the logic a bit ugly. Will give it a shot.

Ahh, may not be able to remove check_nulls entirely. I needed it for factorizing boolean arrays. might still be able to... Anyway, this is what I had in mind:

diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index bf192cdb2..21ec17054 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -511,7 +511,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): values = _ensure_arraylike(values) original = values - if is_categorical_dtype(values): + if is_categorical_dtype(values) or is_datetime64_any_dtype(values): values = getattr(values, '_values', values) labels, uniques = values.factorize() dtype = original.dtype diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py index b906ea0f4..bd33828f2 100644 --- a/pandas/core/indexes/datetimelike.py +++ b/pandas/core/indexes/datetimelike.py @@ -984,6 +984,13 @@ class DatetimeIndexOpsMixin(object): return algorithms.isin(self.asi8, values.asi8) + def factorize(self, sort=False, na_sentinel=-1): + from pandas.core.algorithms import _factorize_array + l, u = _factorize_array(self, True, na_sentinel=na_sentinel, + na_value=iNaT) + u = self._constructor(u) + return l, u + def shift(self, n, freq=None): """ Specialized shift which produces a DatetimeIndex

Will have to fix it to check for timedeltaindex too, handle sorting, but not too bad. Worth doing here, or a separate PR?

Got it. Will be able to remove check_nulls entirely without too much effort.

TomAugspurger · 2018-03-25T00:11:17Z

Removed check_nulls entirely.

I removed DatetimeIndexOpsMixin.factorize and put the logic in pandas.factorize. It has to go there so that we correctly handle ndarray[datetime64ns].

jreback · 2018-03-25T13:45:18Z

pandas/_libs/hashtable_class_helper.pxi.in

-dtypes = [('Float64', 'float64', 'val != val', True),
-          ('UInt64', 'uint64', 'False', False),
-          ('Int64', 'int64', 'val == iNaT', False)]
+# name, dtype, float_group, default_na_value


when is float_group used? seems superfluous?

Seems to be used in unique.

jreback · 2018-03-25T13:47:28Z

pandas/core/algorithms.py

@@ -508,10 +513,17 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
        dtype = original.dtype
    else:
        values, dtype, _ = _ensure_data(values)
-        check_nulls = not is_integer_dtype(original)
-        labels, uniques = _factorize_array(values, check_nulls,
+


use pandas.core.dtypes.missing.na_value_for_dtype (or maybe add a kwarg to return the underlying value). just want to try to keep this logic in 1 place.

Do you know why na_value_for_dtype(PeriodDtype) is nan? It should be NaT, right?

…orize-na-value

TomAugspurger · 2018-03-26T11:58:28Z

I'm running the tests locally with na_value_for_dtype(Period) returning NaT. If that passes I'll change factorize to use it.

TomAugspurger · 2018-03-26T13:09:24Z

@jreback are you OK with this now, including the changes to na_value_for_dtype(Period)? Would like to get this merged so I can finish of factorize, and make a PR for groupby.

TomAugspurger · 2018-03-26T16:06:04Z

Ran the ASV, and nothing stood out.

jreback · 2018-03-27T10:25:26Z

pandas/tests/dtypes/test_missing.py

+    (np.dtype("M8[ns]"), NaT),
+    (np.dtype("m8[ns]"), NaT),
+    (DatetimeTZDtype('datetime64[ns, US/Eastern]'), NaT),
+    (PeriodDtype("M"), NaT),


jreback · 2018-03-27T10:28:28Z

thanks @TomAugspurger

TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 23, 2018

TomAugspurger commented Mar 23, 2018

View reviewed changes

REF: Moved to get_labels

3c18428

WillAyd reviewed Mar 23, 2018

View reviewed changes

Remove python-level use_na_value

703ab8a

REF: More cleanup

ab32e0f

Removed null_condition from the template. Had Categorical use na_value

API: Make it non-public

62fa538

TomAugspurger changed the title ~~[WIP]: Parametrized NA sentinel for factorize~~ Parametrized NA sentinel for factorize Mar 24, 2018

Revert formatting changes in pxd

28fad50

TomAugspurger mentioned this pull request Mar 24, 2018

ENH/API: ExtensionArray.factorize #20361

Merged

TomAugspurger added 5 commits March 24, 2018 07:00

linting

8580754

Handle bool

cf14ee1

Merge remote-tracking branch 'upstream/master' into parametrized-fact…

8141131

…orize-na-value

Specify dtypes

a23d451

Remove unused variable.

b25f3d4

Added PyObject hashtable test

jreback requested changes Mar 24, 2018

View reviewed changes

REF: Removed check_nulls

dfcda85

jreback requested changes Mar 25, 2018

View reviewed changes

TomAugspurger added 2 commits March 26, 2018 06:37

BUG: NaT for period

eaff342

Merge remote-tracking branch 'upstream/master' into parametrized-fact…

c05c807

…orize-na-value

Other hashtable

e786253

na_value_for_dtype PeriodDtype

465d458

TomAugspurger mentioned this pull request Mar 26, 2018

Fixed factorize for MACArray ContinuumIO/cyberpandas#13

Merged

jreback reviewed Mar 27, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Mar 27, 2018

jreback approved these changes Mar 27, 2018

View reviewed changes

jreback merged commit 1b1b99b into pandas-dev:master Mar 27, 2018

javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018

Parametrized NA sentinel for factorize (pandas-dev#20473)

a346778

dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018

Parametrized NA sentinel for factorize (pandas-dev#20473)

c905e4a

kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018

Parametrized NA sentinel for factorize (pandas-dev#20473)

970f5d5

TomAugspurger deleted the parametrized-factorize-na-value branch May 2, 2018 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametrized NA sentinel for factorize #20473

Parametrized NA sentinel for factorize #20473

TomAugspurger commented Mar 23, 2018 •

edited

Loading

TomAugspurger Mar 23, 2018

WillAyd Mar 23, 2018

TomAugspurger Mar 23, 2018

WillAyd Mar 23, 2018

TomAugspurger Mar 23, 2018 •

edited

Loading

WillAyd Mar 23, 2018

TomAugspurger Mar 23, 2018

WillAyd Mar 23, 2018

TomAugspurger Mar 23, 2018

WillAyd Mar 23, 2018

WillAyd Mar 23, 2018

WillAyd Mar 23, 2018 •

edited

Loading

WillAyd Mar 23, 2018 •

edited

Loading

codecov bot commented Mar 23, 2018 •

edited

Loading

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018

TomAugspurger commented Mar 24, 2018

WillAyd commented Mar 24, 2018

TomAugspurger commented Mar 24, 2018

TomAugspurger commented Mar 24, 2018

jreback Mar 24, 2018 •

edited

Loading

TomAugspurger Mar 24, 2018

TomAugspurger Mar 24, 2018 •

edited

Loading

TomAugspurger Mar 24, 2018

TomAugspurger commented Mar 25, 2018

jreback Mar 25, 2018

TomAugspurger Mar 26, 2018

jreback Mar 25, 2018

TomAugspurger Mar 26, 2018

TomAugspurger commented Mar 26, 2018

TomAugspurger commented Mar 26, 2018

TomAugspurger commented Mar 26, 2018

jreback Mar 27, 2018

jreback commented Mar 27, 2018

Parametrized NA sentinel for factorize #20473

Parametrized NA sentinel for factorize #20473

Conversation

TomAugspurger commented Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

WillAyd Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Mar 23, 2018 • edited Loading

Codecov Report

TomAugspurger commented Mar 23, 2018

WillAyd commented Mar 23, 2018

TomAugspurger commented Mar 24, 2018

WillAyd commented Mar 24, 2018

TomAugspurger commented Mar 24, 2018

TomAugspurger commented Mar 24, 2018

jreback Mar 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 26, 2018

TomAugspurger commented Mar 26, 2018

TomAugspurger commented Mar 26, 2018

Choose a reason for hiding this comment

jreback commented Mar 27, 2018

TomAugspurger commented Mar 23, 2018 •

edited

Loading

TomAugspurger Mar 23, 2018 •

edited

Loading

WillAyd Mar 23, 2018 •

edited

Loading

WillAyd Mar 23, 2018 •

edited

Loading

codecov bot commented Mar 23, 2018 •

edited

Loading

jreback Mar 24, 2018 •

edited

Loading

TomAugspurger Mar 24, 2018 •

edited

Loading