Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

jankatins · 2014-07-23T15:10:08Z

When this is merged, make the below change in Categorical as well!

Having np.nan in an Index is not returning the position of NaN but -1:

In[5]: from pandas.core.index import _ensure_index
In[6]: import numpy as np
In[7]: idx = _ensure_index([1,2,3,4, np.nan])
In[8]: np.nan in idx
Out[8]: True
In[9]: idx.get_indexer([np.nan])
Out[9]: array([-1])

I'm not sure if that's a bug or intended. What happens is that this (new) test for Categoricals fails:

        # if nan in levels, the proper code should be set!
        cat = pd.Categorical([1,2,3, np.nan], levels=[1,2,3])
        cat.levels = [1,2,3, np.nan]
        cat[1] = np.nan
        exp = np.array([0,3,2,-1])
        self.assert_numpy_array_equal(cat.codes, exp)

Traceback (most recent call last):
  File "C:\data\external\pandas\pandas\tests\test_categorical.py", line 555, in test_set_item_nan
    self.assert_numpy_array_equal(cat.codes, exp)
  File "C:\data\external\pandas\pandas\util\testing.py", line 99, in assert_numpy_array_equal
    raise AssertionError('{0} is not equal to {1}.'.format(np_array, assert_equal))
AssertionError: [ 0 -1  2 -1] is not equal to [ 0  3  2 -1].

The text was updated successfully, but these errors were encountered:

cpcloud · 2014-07-23T15:11:42Z

This is how nans are encoded. There's no way to order them, so they get a sentinel value for their indexes. In Float64Index their locations are stored using a boolean index.

cpcloud · 2014-07-23T15:12:24Z

@JanSchulz not sure if that answers your question, but maybe sheds some light on why you're getting this result

cpcloud · 2014-07-23T15:16:15Z

Where is this failing for you? I get a failure during cat[1] = np.nan.

cpcloud · 2014-07-23T15:17:04Z

also i would expect exp == np.array([0, -1, 2, -1]) if assignment did succeed.

jankatins · 2014-07-23T16:08:41Z

Note that the above is run on top of #7768, which already has a fix for encoding np.nan as -1 if np.nan is not in levels.

In Categoricals, the order which is passed in to the index is enough (no sorting necessary), so IMO it should return the position of the np.nan.

In some other cases this actually happen when I use the internal cython code, e.g. in pandas.core.categorical._get_codes_for_values, which correctly encodes np.nan as the position in the levels, but in that case uses a PyObjectHashTable instead of a Float64HashTable.

pd.Categorical([1,2,np.nan], levels=[1,2,np.nan]).codes
array([0, 1, 2])

So IMO a float index should also consider np.nan as a valid index value (or should refuse to have that included).

If NaN will be a proper value in a float index, I can use that. Otherwise I will probably ensure that I have a object index if NaN is in levels, because this works:

In[19]: from pandas.core.index import Index
In[20]: idx = Index([1,2,np.nan], dtype="object")
In[21]: idx.get_indexer([np.nan])
Out[21]: array([2])

jreback · 2014-07-23T16:10:42Z

@JanSchulz where are you directly using get_indexer? factorize already does this properly

jankatins · 2014-07-23T16:18:52Z

Categorical.__setitem__ is using get_indexer.

Categorical does not use factorize if a level is already passed in but _get_codes_for_values in Categorical. But that function also encodes NaN correctly because a "object" dtype is ensured

In[26]: _get_codes_for_values(np.array([1,2,np.nan]), np.array([1,2,np.nan]))
Out[26]: array([ 0,  1, -1])
In[27]: _get_codes_for_values(np.array([1,2,np.nan]), np.array([1,2,np.nan], dtype="object"))
Out[27]: array([0, 1, 2])

jankatins · 2014-07-23T16:22:09Z

Note that this bug is not so much about the Categorical useage, but about the inconsistency between Object and Float index, which handle np.nan differently.

IMO there ware three ways:

do nothing: leave in the inconsistency and don't prevent np.nan in float index (IMO a bug -> this here :-)
Don't allow np.nan in float index but raise in constructor if such a value is passed in
allow np.nan and change the hash table to accept np.nan like object does.

From my standpoint both the last two are fine: either I change the Categorical to change the levels to dtype object of a NaN is in levels or I don't have anything to do in Categorical :-)

jreback · 2014-07-23T16:23:35Z

hmm.

@cpcloud how tricky to change Float64HashTable to enumerate the indexing positions for nans?
I guess they would all have to be the same! for 1 nan this is fine/expected. Multi-nans are a problem though.

@JanSchulz

object dtypes are no longer used for Float64Index.

cpcloud · 2014-07-23T16:25:21Z

Hm i'll see ... can take a look tonight

jreback · 2014-07-23T16:28:04Z

@JanSchulz this ONLY works for a unique index and ONLY for object

e.g. I think these could return the same result

Index([1,2,np.nan]).get_indexer([np.nan])
Index([1,2,np.nan],dtype=object).get_indexer([np.nan])

These both don't know what to do (and so return -1)

Index([1,np.nan,2,np.nan]).get_indexer_for([np.nan])
Index([1,np.nan,2,np.nan],dtype=object).get_indexer_for([np.nan])

So @JanSchulz still need to be wary of the multiple np.nan case (e.g. if you have a -1 as an indexer I think its an error (separate from this)

jankatins · 2014-07-23T19:33:31Z

Uniqueness isn't a problem with (unique) Categorical levels.

But why not do the same as with this:

In[28]: Index([1,2,2,3]).get_indexer_for([2])
Out[28]: Int64Index([1, 2], dtype='int64')

jreback · 2014-07-23T19:46:04Z

since you cannot compare np.nan with np.nan at all, the routines treat them specially (for dtype=float64). For dtype==object (which is what the backend of Float64Index was prior to 0.14.0) this was correct (I guess), but WAY WAY slower (that's why it was changed).

This is what @cpcloud will look at (for a single nan case it IS possible to get an indexer).

cpcloud · 2014-07-27T16:25:21Z

@JanSchulz

Note that this bug is not so much about the Categorical useage, but about the inconsistency between Object and Float index, which handle np.nan differently.

NaNs are not well-behaved objects. They have the property that nan != nan, but this screws up some basic assumptions that many Python sequence objects make when comparing their contents with another sequence. For example,

In [3]: x = np.nan

In [4]: [x] == [x]
Out[4]: True

In [5]: x == x
Out[5]: False

I looked into the C code here to finally put my curiosity about nans to rest, and it's because the list_richcompare function calls PyObject_RichCompareBool (as opposed to PyObject_RichCompare, which does the right thing) on each element. PyObject_RichCompareBool contains these lines (v and w are both PyObject *s):

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

So in the case of nan (and really any other object that one would choose to override this way), id(x) == id(x) does not imply that x == x which makes dealing with nan a huge PITA.

object arrays behave essentially like a list which leads to list-like comparison behavior:

In [12]: x = np.array([np.nan, np.nan], object)

In [13]: x == x
Out[13]: array([ True,  True], dtype=bool)

Finally, Float64Index is now (since 0.14.0 I think) backed by a real float64 array so it behaves like nan should behave. That of course breaks compatibility with older usage, where Float64Index was essentially just Index with a slightly different name.

I should note that id(nan) == id(nan) is just an implementation detail (e.g., id(float('nan')) != id(float('nan')), but it could), but one that certainly matters in this case.

jankatins · 2014-07-27T20:42:47Z

I still don't get what it would be so problematic to return the np.nan position if the position is known due to the "boolean index" (see first comment)? Or is that a performance problem due to the NaN check on the get_indexer(...) inputs? My idea of such a thing:

# in Float64Index
def get_indexer(self, *args, **kwargs):
    idx = super(Float64Index, self).get_indexer(*args, **kwargs)
    if self.hasnans and np.isnan(other): # from __contains__()
        # get_indexer only works if the index is unique so get_loc will return a single value
        idx[idx == -1] = self.get_loc(np.nan)
    return idx

jankatins · 2014-08-15T19:29:47Z

If this is fixed, the workaround in categorical.py should be removed:

        # FIXME: the following can be removed after https://github.com/pydata/pandas/issues/7820
        # is fixed.
        # float levels do currently return -1 for np.nan, even if np.nan is included in the index
        # "repair" this here
        if com.isnull(rvalue).any() and com.isnull(self.levels).any():
            nan_pos = np.where(com.isnull(self.levels))
            lindexer[lindexer == -1] = nan_pos

[This is in #8007, so not yet merged]

jreback · 2014-08-15T19:31:06Z

yep, @JanSchulz put this on the list of future fixmes for categorical (create a new issue)

jankatins · 2014-08-15T19:42:26Z

Isn't it easier to add a todo to this issue that this is taken out when this issue is fixed?

jankatins · 2014-08-15T19:44:45Z

Seems someone already did in #7855 :-) So I think this is taken care of.

jreback · 2014-08-15T19:47:16Z

I did that

yes it's easier if their is a specifc issue involved

the problem is that unless it's at the top it's not obvious and won't get done

jreback · 2018-07-06T21:36:32Z

this is ok in master, just needs tests to close

jankatins changed the title ~~Index([1,2,.np.nan]).get_indexer_for([np.nan]) returns wrong value?~~ Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? Jul 23, 2014

jreback added API Design labels Jul 23, 2014

jreback added this to the 0.15.0 milestone Jul 23, 2014

cpcloud mentioned this issue Jul 27, 2014

BUG/API: allow get_indexer to work with nans #7855

Closed

1 task

cpcloud self-assigned this Aug 8, 2014

jreback mentioned this issue Aug 12, 2014

CLN/DOC/TST: Categorical fixups (GH7768) #8007

Merged

jreback modified the milestones: 0.15.1, 0.15.0 Sep 29, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback added Difficulty Advanced labels Apr 8, 2015

jreback mentioned this issue Apr 8, 2015

PyCon 2015 sprints #9811

Closed

jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 6, 2018

jreback added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 6, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke mentioned this issue Dec 30, 2019

TST: Regression testing for fixed issues #30554

Merged

9 tasks

simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Dec 30, 2019

TomAugspurger unassigned cpcloud Dec 30, 2019

TomAugspurger closed this as completed in #30554 Dec 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

jankatins commented Jul 23, 2014

cpcloud commented Jul 23, 2014

cpcloud commented Jul 23, 2014

cpcloud commented Jul 23, 2014

cpcloud commented Jul 23, 2014

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

jankatins commented Jul 23, 2014

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

cpcloud commented Jul 23, 2014

jreback commented Jul 23, 2014

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

cpcloud commented Jul 27, 2014

jankatins commented Jul 27, 2014

jankatins commented Aug 15, 2014

jreback commented Aug 15, 2014

jankatins commented Aug 15, 2014

jankatins commented Aug 15, 2014

jreback commented Aug 15, 2014

jreback commented Jul 6, 2018

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

Comments

jankatins commented Jul 23, 2014

cpcloud commented Jul 23, 2014

cpcloud commented Jul 23, 2014

cpcloud commented Jul 23, 2014

cpcloud commented Jul 23, 2014

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

jankatins commented Jul 23, 2014

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

cpcloud commented Jul 23, 2014

jreback commented Jul 23, 2014

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

cpcloud commented Jul 27, 2014

jankatins commented Jul 27, 2014

jankatins commented Aug 15, 2014

jreback commented Aug 15, 2014

jankatins commented Aug 15, 2014

jankatins commented Aug 15, 2014

jreback commented Aug 15, 2014

jreback commented Jul 6, 2018