Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrized NA sentinel for factorize #20473

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Mar 23, 2018

Adds a new keyword na_value to control the NA sentinel inside the factorize
routine.

In [3]: arr = np.array([0, 1, 0, 2], dtype='u8')

In [4]: pd.factorize(arr)
Out[4]: (array([0, 1, 0, 2]), array([0, 1, 2], dtype=uint64))

In [5]: pd.factorize(arr, na_value=0)
Out[5]: (array([-1,  0, -1,  1]), array([1, 2], dtype=uint64))

The basic idea is that our hashtables now have two "modes" for detecting NA values

  1. The previous way, based on the templated na_condition
  2. Equality to a user-specified NA value.

Note: specifying NA value does not "turn off" the old way of checking NAs.

In [2]: pd.factorize(np.array([0.0, np.nan, 1.0]), na_value=0.0)
Out[2]: (array([-1, -1,  0]), array([1.]))

I'm sure the implementation can be cleaned up, but I wanted to put this up for feedback. Hopefully someone can tell me how to avoid the use_default_na nonsense :)

cc @WillAyd

closes #20328

Adds a new keyword `na_value` to control the NA sentinel inside the factorize
routine.

```python
In [3]: arr = np.array([0, 1, 0, 2], dtype='u8')

In [4]: pd.factorize(arr)
Out[4]: (array([0, 1, 0, 2]), array([0, 1, 2], dtype=uint64))

In [5]: pd.factorize(arr, na_value=0)
Out[5]: (array([-1,  0, -1,  1]), array([1, 2], dtype=uint64))
```
@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Mar 23, 2018

cdef class {{name}}HashTable(HashTable):

def __cinit__(self, size_hint=1):
def __cinit__(self, size_hint=1, {{dtype}}_t na_value={{default_na_value}},
bint use_na_value=False):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, should this just be moved to get_labels?

self.table = kh_init_{{dtype}}()
self.na_value = na_value
self.use_na_value = use_na_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you even need use_na_value at all? Conceptually can't you just rely on the existing check_null implementation and simply use whatever na_value gets set to (whether default or provided by user)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure :) Without use_na_value, wouldn't that break

pd.factorize(np.array([0, 1], dtype='u8'))

On master the labels [0, 1]. But without use_na_value, we'd see that 0 is equal to the default UInt64HashTable na_value of 0, so the labels would be [-1, 0].

I'll try that now to make sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK I see. I'll say this with the caveat that I haven't used factorize before, but does it even make sense to have a NA value for boolean types? Wouldn't the presence of NA force that conversion to float?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best use-case to focus on is Categorical. In its codes, missing values are -1. So we'd like to pass an array of int64 to factorize, and have it think that -1 is NA.

We avoid NaN for datetime / categorical for that reason: it would force them to float. For the Int64HashTable, we use iNaT (the smallest int) as the missing value sentinel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks - that clears things up a lot. Shouldn't the default missing value for the type then be -1 instead of 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Int64 or UInt64? If it's UInt64 it'll have to be 0, since -1 is negative :)

I need to better describe default_na_value. It's never actually used, since the only way it could be used is if the user explicitly passes na_value. Essentially, I want

cdef f(int64_t x=None):
    if x is not None:
        ...

But in a way that makes cython happy. There's probably a way to do that...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the -1 with the unsigned int - you'll have to excuse me (it's Friday).

So perhaps one thing to consider is instead of declaring x to be int64_t in your example (or whatever type is dictated by the template) you could keep it as an object in the signature. In the cdef block add the declaration for your variable, and then do your if x is not None construct in the body before the nogil, where within that conditional you could cast the object to the particular type and assign it to the variable you declared in the cdef section.

Kind of wordy so let me know if that's not clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

diff --git a/pandas/_libs/hashtable_class_helper.pxi.in b/pandas/_libs/hashtable_class_helper.pxi.in
index af60eb4b1..07bdad241 100644
--- a/pandas/_libs/hashtable_class_helper.pxi.in
+++ b/pandas/_libs/hashtable_class_helper.pxi.in
@@ -409,26 +409,32 @@ cdef class {{name}}HashTable(HashTable):
     def get_labels(self, {{dtype}}_t[:] values, {{name}}Vector uniques,
                    Py_ssize_t count_prior, Py_ssize_t na_sentinel,
                    bint check_null=True,
-                   {{dtype}}_t na_value={{default_na_value}},
-                   bint use_na_value=False):
+                   object na_value=None):
         cdef:
             Py_ssize_t i, n = len(values)
             int64_t[:] labels
             Py_ssize_t idx, count = count_prior
             int ret = 0
-            {{dtype}}_t val
+            {{dtype}}_t val, na_value2
             khiter_t k
             {{name}}VectorData *ud
+            bint use_na_value
 
         labels = np.empty(n, dtype=np.int64)
         ud = uniques.data
+        use_na_value = na_value is not None
+
+        if use_na_value:
+            na_value2 = <{{dtype}}_t>na_value
+        else:
+            na_value2 = {{default_na_value}}
 
         with nogil:
             for i in range(n):
                 val = values[i]
 
                 if ((check_null and {{null_condition}}) or
-                        (use_na_value and val == na_value)):
+                        (use_na_value and val == na_value2)):
                     labels[i] = na_sentinel
                     continue
 
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index bbf60c2ee..e668b4748 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -456,16 +456,9 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None,
     """
     (hash_klass, vec_klass), values = _get_data_algo(values, _hashtables)
 
-    use_na_value = na_value is not None
-    kwargs = dict(use_na_value=use_na_value)
-
-    if use_na_value:
-        kwargs['na_value'] = na_value
-
     table = hash_klass(size_hint or len(values))
     uniques = vec_klass()
-    labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls,
-                              **kwargs)
+    labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls, na_value=na_value)
 
     labels = _ensure_platform_int(labels)
     uniques = uniques.to_array()

seems to work. Is that what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be a little more explicit something along the lines of

cdef f(object x=None):
    cdef:
        {{template_type}} x_val

    if x is not None:
        x_val = <{{template_type}}> x
    ...
    with nogil:
         ...

Should still work because you aren't trying to access any Python objects within the nogil - the compiler should know that x_val is of type {{template_type}} and work with that accordingly, even though that value doesn't get defined until runtime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied at about the same time but yes that's what I was thinking. Good to hear it works

('Int64', 'int64', 'val == iNaT', False)]
# name, dtype, null_condition, float_group, default_na_value
dtypes = [('Float64', 'float64', 'val != val', True, 'nan'),
('UInt64', 'uint64', 'False', False, 0),
Copy link
Member

@WillAyd WillAyd Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you didn't set this, but is the uint64 track for bool hashing? If so, any reason why we don't use uint8 to save space?


with nogil:
for i in range(n):
val = values[i]

if check_null and {{null_condition}}:
if ((check_null and {{null_condition}}) or
Copy link
Member

@WillAyd WillAyd Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above comment, is it not possible to do something like

if (check_null and (val != val or val == na_value)):

To still cover all of the scenarios? (borrowed the above logic from groupby_helper)

@codecov
Copy link

codecov bot commented Mar 23, 2018

Codecov Report

Merging #20473 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #20473   +/-   ##
=======================================
  Coverage   91.82%   91.82%           
=======================================
  Files         152      152           
  Lines       49235    49235           
=======================================
  Hits        45212    45212           
  Misses       4023     4023
Flag Coverage Δ
#multiple 90.21% <100%> (ø) ⬆️
#single 41.89% <87.5%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/missing.py 91.07% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 96.19% <100%> (-0.02%) ⬇️
pandas/core/algorithms.py 94.31% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c0c277...465d458. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

Thanks to @WillAyd we've cleaned up some of the hackiness.

Now I'd like to see if we can remove some of the strangeness around the two ways of having nulls, {{null_condition}} or a user-specified NA value.

And more tests.

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2018

@TomAugspurger did the below not work to consolidate the "two null methods"?

if (check_null and (val != val or val == na_value))

If so we could remove null_condition from the templating altogether (would make it more consistent with groupby). I don't think you need to have use_na_value as part of the conditional because that should be setting the appropriate na_value earlier in the method (maybe?)

Removed null_condition from the template.

Had Categorical use na_value
@TomAugspurger
Copy link
Contributor Author

👍 I had to modify check_null to be True whenever na_value is specified.

I want to add more tests around the behavior of specifying an na_value and existing NA values (NaN, NaT, None, etc.). Do we agree with the following behavior?

In [17]: pd.factorize([0.0, np.nan, 1.0], na_value=0.0)
Out[17]: (array([-1, -1,  0]), array([1.]))

i.e. NaN will always be NA. This is currently broken for NaT I think.

In [19]: pd.factorize(pd.DatetimeIndex([pd.Timestamp(0), pd.Timestamp(1), None]), na_value=0)
Out[19]:
(array([-1,  0,  1]),
 DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None))

That should be [-1, 0, -1] for the labels I think...

@WillAyd
Copy link
Member

WillAyd commented Mar 24, 2018

I actually think the first example should raise a ValueError, if only because I don't see a reason why a person would intentionally provide a na_value in an array that contains NA. The reasoning for that is unclear and I think it would be better to signal to the user that something may be awry (they probably wanted to fill the missing value first?)

For the second example, I also would prefer a ValueError gets thrown when the type of na_value does not match the type of the values being factorized (not applicable for object dtypes). pd.Timestamp(0) == 0 is False so I wouldn't expect factorize to roll its own equality comp.

Ignoring that, depending on how you feel about my stance on example one the second example should either raise a ValueError or go [-1, 0, -1] as you mentioned. Just my $.02

@TomAugspurger
Copy link
Contributor Author

Thinking more on this, I think we should not make the na_value parameter public to end users. I think it should only be available in the private _factorize_array method. Two reasons

  1. Some awkwardness around calling pd.factorize(Categorical, na_value=foo). That "works", but the na_value is ignored since the chain is pd.factorize -> Categorical.factorize -> pd.factorize, and Categorical.factorize doesn't take na_value so it's dropped.
  2. The only concrete use-cases we have is extension arrays (including internal ones). In this case, we won't be mixing multiple missing markers. We can either tell EA authors that _factorize_array is public to them, or just ask them "what's your NA value" and pass it to _factorize_array in the default implementation for them.

@TomAugspurger TomAugspurger changed the title [WIP]: Parametrized NA sentinel for factorize Parametrized NA sentinel for factorize Mar 24, 2018
@TomAugspurger
Copy link
Contributor Author

Updated to move the parameter just to _factorize_array, and not the public factorize. I've tested a variant of #20361 built on this. It works well for JSONArray, IPArray (object hashtable) and MACArray (uint64 hashtable). And of course Categorical, which I've changed here to use na_value.

@@ -445,19 +446,27 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
values : ndarray
check_nulls : bool
Whether to check for nulls in the hashtable's 'get_labels' method.
Nulls are always checked when `na_value` is specified.
Copy link
Contributor

@jreback jreback Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are going to add this kwarg (which is ok), then remove check_nulls which is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I posted a comment about that, but I guess I deleted it.

AFAICT check_nulls is only useful for DatetimeIndex, so that these two are different?

In [2]: pd.factorize([2**-63, 0])
Out[2]: (array([0, 1]), array([1.08420217e-19, 0.00000000e+00]))

In [3]: pd.factorize(pd.DatetimeIndex([None, 0]))
Out[3]:
(array([-1,  0]),
 DatetimeIndex(['1970-01-01'], dtype='datetime64[ns]', freq=None))

Once DatetimeIndex is an ExtensionArray, pd.factorize will dispatch to DatetimeIndex.factorize, which can call _factorize_array(..., na_value=iNaT). I think I can do this now, it just makes the logic a bit ugly. Will give it a shot.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, may not be able to remove check_nulls entirely. I needed it for factorizing boolean arrays. might still be able to... Anyway, this is what I had in mind:

diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index bf192cdb2..21ec17054 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -511,7 +511,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
     values = _ensure_arraylike(values)
     original = values
 
-    if is_categorical_dtype(values):
+    if is_categorical_dtype(values) or is_datetime64_any_dtype(values):
         values = getattr(values, '_values', values)
         labels, uniques = values.factorize()
         dtype = original.dtype
diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py
index b906ea0f4..bd33828f2 100644
--- a/pandas/core/indexes/datetimelike.py
+++ b/pandas/core/indexes/datetimelike.py
@@ -984,6 +984,13 @@ class DatetimeIndexOpsMixin(object):
 
         return algorithms.isin(self.asi8, values.asi8)
 
+    def factorize(self, sort=False, na_sentinel=-1):
+        from pandas.core.algorithms import _factorize_array
+        l, u = _factorize_array(self, True, na_sentinel=na_sentinel,
+                                na_value=iNaT)
+        u = self._constructor(u)
+        return l, u
+
     def shift(self, n, freq=None):
         """
         Specialized shift which produces a DatetimeIndex

Will have to fix it to check for timedeltaindex too, handle sorting, but not too bad. Worth doing here, or a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will be able to remove check_nulls entirely without too much effort.

@TomAugspurger
Copy link
Contributor Author

Removed check_nulls entirely.

I removed DatetimeIndexOpsMixin.factorize and put the logic in pandas.factorize. It has to go there so that we correctly handle ndarray[datetime64ns].

dtypes = [('Float64', 'float64', 'val != val', True),
('UInt64', 'uint64', 'False', False),
('Int64', 'int64', 'val == iNaT', False)]
# name, dtype, float_group, default_na_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is float_group used? seems superfluous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be used in unique.

@@ -508,10 +513,17 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
dtype = original.dtype
else:
values, dtype, _ = _ensure_data(values)
check_nulls = not is_integer_dtype(original)
labels, uniques = _factorize_array(values, check_nulls,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pandas.core.dtypes.missing.na_value_for_dtype (or maybe add a kwarg to return the underlying value). just want to try to keep this logic in 1 place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why na_value_for_dtype(PeriodDtype) is nan? It should be NaT, right?

@TomAugspurger
Copy link
Contributor Author

I'm running the tests locally with na_value_for_dtype(Period) returning NaT. If that passes I'll change factorize to use it.

@TomAugspurger
Copy link
Contributor Author

@jreback are you OK with this now, including the changes to na_value_for_dtype(Period)? Would like to get this merged so I can finish of factorize, and make a PR for groupby.

@TomAugspurger
Copy link
Contributor Author

Ran the ASV, and nothing stood out.

(np.dtype("M8[ns]"), NaT),
(np.dtype("m8[ns]"), NaT),
(DatetimeTZDtype('datetime64[ns, US/Eastern]'), NaT),
(PeriodDtype("M"), NaT),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@jreback jreback added this to the 0.23.0 milestone Mar 27, 2018
@jreback jreback merged commit 1b1b99b into pandas-dev:master Mar 27, 2018
@jreback
Copy link
Contributor

jreback commented Mar 27, 2018

thanks @TomAugspurger

javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018
dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
@TomAugspurger TomAugspurger deleted the parametrized-factorize-na-value branch May 2, 2018 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parameterize NA sentinel for hashtables
3 participants