-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parametrized NA sentinel for factorize #20473
Parametrized NA sentinel for factorize #20473
Conversation
Adds a new keyword `na_value` to control the NA sentinel inside the factorize routine. ```python In [3]: arr = np.array([0, 1, 0, 2], dtype='u8') In [4]: pd.factorize(arr) Out[4]: (array([0, 1, 0, 2]), array([0, 1, 2], dtype=uint64)) In [5]: pd.factorize(arr, na_value=0) Out[5]: (array([-1, 0, -1, 1]), array([1, 2], dtype=uint64)) ```
|
||
cdef class {{name}}HashTable(HashTable): | ||
|
||
def __cinit__(self, size_hint=1): | ||
def __cinit__(self, size_hint=1, {{dtype}}_t na_value={{default_na_value}}, | ||
bint use_na_value=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, should this just be moved to get_labels
?
self.table = kh_init_{{dtype}}() | ||
self.na_value = na_value | ||
self.use_na_value = use_na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you even need use_na_value
at all? Conceptually can't you just rely on the existing check_null
implementation and simply use whatever na_value
gets set to (whether default or provided by user)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure :) Without use_na_value
, wouldn't that break
pd.factorize(np.array([0, 1], dtype='u8'))
On master the labels [0, 1]
. But without use_na_value
, we'd see that 0
is equal to the default UInt64HashTable na_value
of 0
, so the labels would be [-1, 0]
.
I'll try that now to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK I see. I'll say this with the caveat that I haven't used factorize before, but does it even make sense to have a NA value for boolean types? Wouldn't the presence of NA force that conversion to float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best use-case to focus on is Categorical. In its codes, missing values are -1. So we'd like to pass an array of int64 to factorize, and have it think that -1 is NA.
We avoid NaN for datetime / categorical for that reason: it would force them to float. For the Int64HashTable, we use iNaT (the smallest int) as the missing value sentinel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks - that clears things up a lot. Shouldn't the default missing value for the type then be -1 instead of 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Int64 or UInt64? If it's UInt64 it'll have to be 0, since -1 is negative :)
I need to better describe default_na_value
. It's never actually used, since the only way it could be used is if the user explicitly passes na_value
. Essentially, I want
cdef f(int64_t x=None):
if x is not None:
...
But in a way that makes cython happy. There's probably a way to do that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on the -1 with the unsigned int - you'll have to excuse me (it's Friday).
So perhaps one thing to consider is instead of declaring x to be int64_t
in your example (or whatever type is dictated by the template) you could keep it as an object
in the signature. In the cdef
block add the declaration for your variable, and then do your if x is not None
construct in the body before the nogil
, where within that conditional you could cast the object to the particular type and assign it to the variable you declared in the cdef
section.
Kind of wordy so let me know if that's not clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
diff --git a/pandas/_libs/hashtable_class_helper.pxi.in b/pandas/_libs/hashtable_class_helper.pxi.in
index af60eb4b1..07bdad241 100644
--- a/pandas/_libs/hashtable_class_helper.pxi.in
+++ b/pandas/_libs/hashtable_class_helper.pxi.in
@@ -409,26 +409,32 @@ cdef class {{name}}HashTable(HashTable):
def get_labels(self, {{dtype}}_t[:] values, {{name}}Vector uniques,
Py_ssize_t count_prior, Py_ssize_t na_sentinel,
bint check_null=True,
- {{dtype}}_t na_value={{default_na_value}},
- bint use_na_value=False):
+ object na_value=None):
cdef:
Py_ssize_t i, n = len(values)
int64_t[:] labels
Py_ssize_t idx, count = count_prior
int ret = 0
- {{dtype}}_t val
+ {{dtype}}_t val, na_value2
khiter_t k
{{name}}VectorData *ud
+ bint use_na_value
labels = np.empty(n, dtype=np.int64)
ud = uniques.data
+ use_na_value = na_value is not None
+
+ if use_na_value:
+ na_value2 = <{{dtype}}_t>na_value
+ else:
+ na_value2 = {{default_na_value}}
with nogil:
for i in range(n):
val = values[i]
if ((check_null and {{null_condition}}) or
- (use_na_value and val == na_value)):
+ (use_na_value and val == na_value2)):
labels[i] = na_sentinel
continue
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index bbf60c2ee..e668b4748 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -456,16 +456,9 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None,
"""
(hash_klass, vec_klass), values = _get_data_algo(values, _hashtables)
- use_na_value = na_value is not None
- kwargs = dict(use_na_value=use_na_value)
-
- if use_na_value:
- kwargs['na_value'] = na_value
-
table = hash_klass(size_hint or len(values))
uniques = vec_klass()
- labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls,
- **kwargs)
+ labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls, na_value=na_value)
labels = _ensure_platform_int(labels)
uniques = uniques.to_array()
seems to work. Is that what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be a little more explicit something along the lines of
cdef f(object x=None):
cdef:
{{template_type}} x_val
if x is not None:
x_val = <{{template_type}}> x
...
with nogil:
...
Should still work because you aren't trying to access any Python objects within the nogil - the compiler should know that x_val is of type {{template_type}}
and work with that accordingly, even though that value doesn't get defined until runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied at about the same time but yes that's what I was thinking. Good to hear it works
('Int64', 'int64', 'val == iNaT', False)] | ||
# name, dtype, null_condition, float_group, default_na_value | ||
dtypes = [('Float64', 'float64', 'val != val', True, 'nan'), | ||
('UInt64', 'uint64', 'False', False, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize you didn't set this, but is the uint64
track for bool hashing? If so, any reason why we don't use uint8
to save space?
|
||
with nogil: | ||
for i in range(n): | ||
val = values[i] | ||
|
||
if check_null and {{null_condition}}: | ||
if ((check_null and {{null_condition}}) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above comment, is it not possible to do something like
if (check_null and (val != val or val == na_value)):
To still cover all of the scenarios? (borrowed the above logic from groupby_helper
)
Codecov Report
@@ Coverage Diff @@
## master #20473 +/- ##
=======================================
Coverage 91.82% 91.82%
=======================================
Files 152 152
Lines 49235 49235
=======================================
Hits 45212 45212
Misses 4023 4023
Continue to review full report at Codecov.
|
Thanks to @WillAyd we've cleaned up some of the hackiness. Now I'd like to see if we can remove some of the strangeness around the two ways of having nulls, And more tests. |
@TomAugspurger did the below not work to consolidate the "two null methods"? if (check_null and (val != val or val == na_value)) If so we could remove |
Removed null_condition from the template. Had Categorical use na_value
👍 I had to modify I want to add more tests around the behavior of specifying an In [17]: pd.factorize([0.0, np.nan, 1.0], na_value=0.0)
Out[17]: (array([-1, -1, 0]), array([1.])) i.e. NaN will always be NA. This is currently broken for NaT I think. In [19]: pd.factorize(pd.DatetimeIndex([pd.Timestamp(0), pd.Timestamp(1), None]), na_value=0)
Out[19]:
(array([-1, 0, 1]),
DatetimeIndex(['1970-01-01 00:00:00.000000001', 'NaT'], dtype='datetime64[ns]', freq=None)) That should be |
I actually think the first example should raise a For the second example, I also would prefer a ValueError gets thrown when the type of na_value does not match the type of the values being factorized (not applicable for object dtypes). Ignoring that, depending on how you feel about my stance on example one the second example should either raise a ValueError or go |
Thinking more on this, I think we should not make the
|
Updated to move the parameter just to |
Added PyObject hashtable test
pandas/core/algorithms.py
Outdated
@@ -445,19 +446,27 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None): | |||
values : ndarray | |||
check_nulls : bool | |||
Whether to check for nulls in the hashtable's 'get_labels' method. | |||
Nulls are always checked when `na_value` is specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to add this kwarg (which is ok), then remove check_nulls
which is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought I posted a comment about that, but I guess I deleted it.
AFAICT check_nulls
is only useful for DatetimeIndex
, so that these two are different?
In [2]: pd.factorize([2**-63, 0])
Out[2]: (array([0, 1]), array([1.08420217e-19, 0.00000000e+00]))
In [3]: pd.factorize(pd.DatetimeIndex([None, 0]))
Out[3]:
(array([-1, 0]),
DatetimeIndex(['1970-01-01'], dtype='datetime64[ns]', freq=None))
Once DatetimeIndex is an ExtensionArray, pd.factorize
will dispatch to DatetimeIndex.factorize
, which can call _factorize_array(..., na_value=iNaT)
. I think I can do this now, it just makes the logic a bit ugly. Will give it a shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, may not be able to remove check_nulls
entirely. I needed it for factorizing boolean arrays. might still be able to... Anyway, this is what I had in mind:
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index bf192cdb2..21ec17054 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -511,7 +511,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
values = _ensure_arraylike(values)
original = values
- if is_categorical_dtype(values):
+ if is_categorical_dtype(values) or is_datetime64_any_dtype(values):
values = getattr(values, '_values', values)
labels, uniques = values.factorize()
dtype = original.dtype
diff --git a/pandas/core/indexes/datetimelike.py b/pandas/core/indexes/datetimelike.py
index b906ea0f4..bd33828f2 100644
--- a/pandas/core/indexes/datetimelike.py
+++ b/pandas/core/indexes/datetimelike.py
@@ -984,6 +984,13 @@ class DatetimeIndexOpsMixin(object):
return algorithms.isin(self.asi8, values.asi8)
+ def factorize(self, sort=False, na_sentinel=-1):
+ from pandas.core.algorithms import _factorize_array
+ l, u = _factorize_array(self, True, na_sentinel=na_sentinel,
+ na_value=iNaT)
+ u = self._constructor(u)
+ return l, u
+
def shift(self, n, freq=None):
"""
Specialized shift which produces a DatetimeIndex
Will have to fix it to check for timedeltaindex too, handle sorting, but not too bad. Worth doing here, or a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Will be able to remove check_nulls
entirely without too much effort.
Removed I removed |
dtypes = [('Float64', 'float64', 'val != val', True), | ||
('UInt64', 'uint64', 'False', False), | ||
('Int64', 'int64', 'val == iNaT', False)] | ||
# name, dtype, float_group, default_na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is float_group used? seems superfluous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be used in unique
.
@@ -508,10 +513,17 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | |||
dtype = original.dtype | |||
else: | |||
values, dtype, _ = _ensure_data(values) | |||
check_nulls = not is_integer_dtype(original) | |||
labels, uniques = _factorize_array(values, check_nulls, | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pandas.core.dtypes.missing.na_value_for_dtype
(or maybe add a kwarg to return the underlying value). just want to try to keep this logic in 1 place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why na_value_for_dtype(PeriodDtype)
is nan? It should be NaT, right?
I'm running the tests locally with |
@jreback are you OK with this now, including the changes to |
Ran the ASV, and nothing stood out. |
(np.dtype("M8[ns]"), NaT), | ||
(np.dtype("m8[ns]"), NaT), | ||
(DatetimeTZDtype('datetime64[ns, US/Eastern]'), NaT), | ||
(PeriodDtype("M"), NaT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
thanks @TomAugspurger |
Adds a new keyword
na_value
to control the NA sentinel inside the factorizeroutine.
The basic idea is that our hashtables now have two "modes" for detecting NA values
Note: specifying NA value does not "turn off" the old way of checking NAs.
I'm sure the implementation can be cleaned up, but I wanted to put this up for feedback. Hopefully someone can tell me how to avoid the
use_default_na
nonsense :)cc @WillAyd
closes #20328