Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

Closed
jankatins opened this issue Jul 23, 2014 · 21 comments · Fixed by #30554
Closed

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

jankatins opened this issue Jul 23, 2014 · 21 comments · Fixed by #30554
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jankatins
Copy link
Contributor

When this is merged, make the below change in Categorical as well!

Having np.nan in an Index is not returning the position of NaN but -1:

In[5]: from pandas.core.index import _ensure_index
In[6]: import numpy as np
In[7]: idx = _ensure_index([1,2,3,4, np.nan])
In[8]: np.nan in idx
Out[8]: True
In[9]: idx.get_indexer([np.nan])
Out[9]: array([-1])

I'm not sure if that's a bug or intended. What happens is that this (new) test for Categoricals fails:

        # if nan in levels, the proper code should be set!
        cat = pd.Categorical([1,2,3, np.nan], levels=[1,2,3])
        cat.levels = [1,2,3, np.nan]
        cat[1] = np.nan
        exp = np.array([0,3,2,-1])
        self.assert_numpy_array_equal(cat.codes, exp)

Traceback (most recent call last):
  File "C:\data\external\pandas\pandas\tests\test_categorical.py", line 555, in test_set_item_nan
    self.assert_numpy_array_equal(cat.codes, exp)
  File "C:\data\external\pandas\pandas\util\testing.py", line 99, in assert_numpy_array_equal
    raise AssertionError('{0} is not equal to {1}.'.format(np_array, assert_equal))
AssertionError: [ 0 -1  2 -1] is not equal to [ 0  3  2 -1].
@jankatins jankatins changed the title Index([1,2,.np.nan]).get_indexer_for([np.nan]) returns wrong value? Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? Jul 23, 2014
@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2014

This is how nans are encoded. There's no way to order them, so they get a sentinel value for their indexes. In Float64Index their locations are stored using a boolean index.

@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2014

@JanSchulz not sure if that answers your question, but maybe sheds some light on why you're getting this result

@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2014

Where is this failing for you? I get a failure during cat[1] = np.nan.

@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2014

also i would expect exp == np.array([0, -1, 2, -1]) if assignment did succeed.

@jankatins
Copy link
Contributor Author

Note that the above is run on top of #7768, which already has a fix for encoding np.nan as -1 if np.nan is not in levels.

In Categoricals, the order which is passed in to the index is enough (no sorting necessary), so IMO it should return the position of the np.nan.

In some other cases this actually happen when I use the internal cython code, e.g. in pandas.core.categorical._get_codes_for_values, which correctly encodes np.nan as the position in the levels, but in that case uses a PyObjectHashTable instead of a Float64HashTable.

pd.Categorical([1,2,np.nan], levels=[1,2,np.nan]).codes
array([0, 1, 2])

So IMO a float index should also consider np.nan as a valid index value (or should refuse to have that included).

If NaN will be a proper value in a float index, I can use that. Otherwise I will probably ensure that I have a object index if NaN is in levels, because this works:

In[19]: from pandas.core.index import Index
In[20]: idx = Index([1,2,np.nan], dtype="object")
In[21]: idx.get_indexer([np.nan])
Out[21]: array([2])

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

@JanSchulz where are you directly using get_indexer? factorize already does this properly

@jankatins
Copy link
Contributor Author

Categorical.__setitem__ is using get_indexer.

Categorical does not use factorize if a level is already passed in but _get_codes_for_values in Categorical. But that function also encodes NaN correctly because a "object" dtype is ensured

In[26]: _get_codes_for_values(np.array([1,2,np.nan]), np.array([1,2,np.nan]))
Out[26]: array([ 0,  1, -1])
In[27]: _get_codes_for_values(np.array([1,2,np.nan]), np.array([1,2,np.nan], dtype="object"))
Out[27]: array([0, 1, 2])

@jankatins
Copy link
Contributor Author

Note that this bug is not so much about the Categorical useage, but about the inconsistency between Object and Float index, which handle np.nan differently.

IMO there ware three ways:

  • do nothing: leave in the inconsistency and don't prevent np.nan in float index (IMO a bug -> this here :-)
  • Don't allow np.nan in float index but raise in constructor if such a value is passed in
  • allow np.nan and change the hash table to accept np.nan like object does.

From my standpoint both the last two are fine: either I change the Categorical to change the levels to dtype object of a NaN is in levels or I don't have anything to do in Categorical :-)

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

hmm.

@cpcloud how tricky to change Float64HashTable to enumerate the indexing positions for nans?
I guess they would all have to be the same! for 1 nan this is fine/expected. Multi-nans are a problem though.

@JanSchulz

object dtypes are no longer used for Float64Index.

@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2014

Hm i'll see ... can take a look tonight

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

@JanSchulz this ONLY works for a unique index and ONLY for object

e.g. I think these could return the same result

Index([1,2,np.nan]).get_indexer([np.nan])
Index([1,2,np.nan],dtype=object).get_indexer([np.nan])

These both don't know what to do (and so return -1)

Index([1,np.nan,2,np.nan]).get_indexer_for([np.nan])
Index([1,np.nan,2,np.nan],dtype=object).get_indexer_for([np.nan])

So @JanSchulz still need to be wary of the multiple np.nan case (e.g. if you have a -1 as an indexer I think its an error (separate from this)

@jreback jreback added this to the 0.15.0 milestone Jul 23, 2014
@jankatins
Copy link
Contributor Author

Uniqueness isn't a problem with (unique) Categorical levels.

But why not do the same as with this:

In[28]: Index([1,2,2,3]).get_indexer_for([2])
Out[28]: Int64Index([1, 2], dtype='int64')

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

since you cannot compare np.nan with np.nan at all, the routines treat them specially (for dtype=float64). For dtype==object (which is what the backend of Float64Index was prior to 0.14.0) this was correct (I guess), but WAY WAY slower (that's why it was changed).

This is what @cpcloud will look at (for a single nan case it IS possible to get an indexer).

@cpcloud
Copy link
Member

cpcloud commented Jul 27, 2014

@JanSchulz

Note that this bug is not so much about the Categorical useage, but about the inconsistency between Object and Float index, which handle np.nan differently.

NaNs are not well-behaved objects. They have the property that nan != nan, but this screws up some basic assumptions that many Python sequence objects make when comparing their contents with another sequence. For example,

In [3]: x = np.nan

In [4]: [x] == [x]
Out[4]: True

In [5]: x == x
Out[5]: False

I looked into the C code here to finally put my curiosity about nans to rest, and it's because the list_richcompare function calls PyObject_RichCompareBool (as opposed to PyObject_RichCompare, which does the right thing) on each element. PyObject_RichCompareBool contains these lines (v and w are both PyObject *s):

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

So in the case of nan (and really any other object that one would choose to override this way), id(x) == id(x) does not imply that x == x which makes dealing with nan a huge PITA.

object arrays behave essentially like a list which leads to list-like comparison behavior:

In [12]: x = np.array([np.nan, np.nan], object)

In [13]: x == x
Out[13]: array([ True,  True], dtype=bool)

Finally, Float64Index is now (since 0.14.0 I think) backed by a real float64 array so it behaves like nan should behave. That of course breaks compatibility with older usage, where Float64Index was essentially just Index with a slightly different name.

I should note that id(nan) == id(nan) is just an implementation detail (e.g., id(float('nan')) != id(float('nan')), but it could), but one that certainly matters in this case.

@jankatins
Copy link
Contributor Author

I still don't get what it would be so problematic to return the np.nan position if the position is known due to the "boolean index" (see first comment)? Or is that a performance problem due to the NaN check on the get_indexer(...) inputs? My idea of such a thing:

# in Float64Index
def get_indexer(self, *args, **kwargs):
    idx = super(Float64Index, self).get_indexer(*args, **kwargs)
    if self.hasnans and np.isnan(other): # from __contains__()
        # get_indexer only works if the index is unique so get_loc will return a single value
        idx[idx == -1] = self.get_loc(np.nan)
    return idx

@jankatins
Copy link
Contributor Author

If this is fixed, the workaround in categorical.py should be removed:

        # FIXME: the following can be removed after https://github.com/pydata/pandas/issues/7820
        # is fixed.
        # float levels do currently return -1 for np.nan, even if np.nan is included in the index
        # "repair" this here
        if com.isnull(rvalue).any() and com.isnull(self.levels).any():
            nan_pos = np.where(com.isnull(self.levels))
            lindexer[lindexer == -1] = nan_pos

[This is in #8007, so not yet merged]

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

yep, @JanSchulz put this on the list of future fixmes for categorical (create a new issue)

@jankatins
Copy link
Contributor Author

Isn't it easier to add a todo to this issue that this is taken out when this issue is fixed?

@jankatins
Copy link
Contributor Author

Seems someone already did in #7855 :-) So I think this is taken care of.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

I did that

yes it's easier if their is a specifc issue involved

the problem is that unless it's at the top it's not obvious and won't get done

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 29, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 6, 2018
@jreback jreback added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Jul 6, 2018
@jreback
Copy link
Contributor

jreback commented Jul 6, 2018

this is ok in master, just needs tests to close

@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.0 Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants