Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix multi-column sort that includes Categoricals / concat (GH7848/GH7864) #7850

Merged
merged 2 commits into from
Jul 29, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jul 26, 2014

CLN: refactor _lexsort_indexer to use Categoricals

closes #7848
closes #7864

@jreback jreback added this to the 0.15.0 milestone Jul 26, 2014
@jreback
Copy link
Contributor Author

jreback commented Jul 26, 2014

xref #7768

cc @JanSchulz

I added _codes_for_ordered to deal with the na_position issue.

order/sort is implemented, but needs tests (with and w/o nan), see the FIXME

But this impacts your PR #7768 as you are doing essentially what this routine is doing.

t.map_locations(levels)
return com._ensure_platform_int(t.lookup(values))

def _codes_as_ordered(codes, levels, order=True, na_position='last', copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a different name or should also include the np.sort(...): When I first looked at this patch on my mobile, I expected from the name that this method does sorting, but actually it only does something with the NaN values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was your original routine. its internal. I think its ok

@jankatins
Copy link
Contributor

From my understanding of that function (I haven't run it), it changes NaN value from a -1 in codes to either 0 or len(levels), but

a) this is wrong: sorting the values shouldn't change the the levels! and
b) the NaN is never added to the levels or levels changed.

So if I understand your patch right, sorting NaN to the beginning will add +1 to all codes, and so in case NA was sorted to the front (0) the code for the last level will now point to level[n+1] and therefore outside of the level index and what was NaN will now have a value of level[0].

I think adding na_position should simple do a rearangement of the -1 to the start or end of the array.

# untested and you probably have a better idea how to do this :-)
codes = np.sort(self._codes.copy()) # -1,0,1,2,3,4
if not ascending
    code = codes[::-1] # 4,3,2,1,0,-1
if na_position='first' and not ascending:
   new_codes = codes.copy()
   mask = (codes == -1)
   n_nans = len(codes[mask])
   # in this case sort to the front, 
   new_codes[1:n_nans] = -1
   new_codes[n_nans:] = codes[mask]
if na_position='last' and ascending:
    # ...

@jreback
Copy link
Contributor Author

jreback commented Jul 28, 2014

@JanSchulz this doesn't change anything, except add the ability to sort by na_position by keyword to order. (which is not tested see the FIXME). This passes all current tests. If you have some additional tests would be fine.

@jankatins
Copy link
Contributor

Ok, the I now see where the reverse sorting happens in _codes_as_ordered (that was too much magic for my amateur numpy knowledge), but I still would vote for inlining this function.

There are also still two bugs, which didn't show up in the unittests :-( :

edit: The second (and third) is the problem with the "-1 to 0/len(levels)" conversation. Not sure why the first shows up, but I think because np.where(mask, n, n-codes-1) is only working in some specific cases, e.g. [0,1,2,3,4] becomes [5-0-1,5-1-1,...]=[4,3,2,1,0] but [0,1,1,2,3] becomes [3,2,2,1,0].

class TestCategorical(tm.TestCase):
    [....]
    def test_sort(self):

        [...]
        # reverse
        cat = Categorical(["a","c","c","b","d"], ordered=True)
        res = cat.order(ascending=False)
        exp_val = np.array(["d","c", "c", "b","a"],dtype=object)
        exp_levels = np.array(["a","b","c","d"],dtype=object)
        # FIXME: res.__array__() ends up as ['d' 'c' 'b' 'b' 'a']
        #self.assert_numpy_array_equal(res.__array__(), exp_val)
        self.assert_numpy_array_equal(res.levels, exp_levels)

        # some NaN positions
        cat = Categorical(["a","c","b","d", np.nan], ordered=True)
        res = cat.order(ascending=False, na_position='last')
        exp_val = np.array(["d","c","b","a", np.nan],dtype=object)
        exp_levels = np.array(["a","b","c","d"],dtype=object)
        # FIXME: IndexError: Out of bounds on buffer access (axis 0)
        #self.assert_numpy_array_equal(res.__array__(), exp_val)
        self.assert_numpy_array_equal(res.levels, exp_levels)

        cat = Categorical(["a","c","b","d", np.nan], ordered=True)
        res = cat.order(ascending=False, na_position='first')
        exp_val = np.array([np.nan, "d","c","b","a"],dtype=object)
        exp_levels = np.array(["a","b","c","d"],dtype=object)
        # FIXME: IndexError: Out of bounds on buffer access (axis 0)
        #self.assert_numpy_array_equal(res.__array__(), exp_val)
        self.assert_numpy_array_equal(res.levels, exp_levels)

@jreback
Copy link
Contributor Author

jreback commented Jul 28, 2014

I didn't inline because I actually needed it for core/groupby/_lexort_indexer which is the 'main' multi-column sorter). It was bascically a Categorical (but duplicating code). So this makes it nice and neat (and as a bonus now have to ability to have na_position being passed).

I'll look at those tests

@jreback
Copy link
Contributor Author

jreback commented Jul 28, 2014

side issue:

should (from your first test)

cat = Categorical(["a","c","c","b","d"], ordered=True)
res = cat.order(ascending=False)

should res reverse the levels too? e.g. ['d','c','b','a']?

@jankatins
Copy link
Contributor

No, ordering should not reverse (or in any way change) the levels, that's what reorder_levels(...) is for.

@jreback
Copy link
Contributor Author

jreback commented Jul 28, 2014

hmm, so then you have cat.order(ascending=False) != cat.reorder_levels(reverse_the_levels) which is odd

@jankatins
Copy link
Contributor

No: in normal ints you also wouldn't change the "underlying order" from "1 is in front of 2" to "2 is in front of 1" if you have to order an int array. cat.order(ascending=False) orders the values in descending order, but the order is determined by the levels and therefore the levels should stay stable as with ordering int arrays.

If you would sort both the codes array and the levels, the values would actually be in ascending order, as afterwards a cat([1,2,3,4], levels=[1,2,3,4]) would end up as cat([4,3,2,1], levels=[4,3,2,1]), i.e. the lowest level (before 1, now 4) would be again at the first position.

cat.reorder_levels() changes the underlying order, something which is not possible with int arrays.

CLN: refactor _lexsort_indexer to use Categoricals
@jreback jreback changed the title BUG: fix multi-column sort that includes Categoricals (GH7848) BUG: fix multi-column sort that includes Categoricals / concat (GH7848/GH7864) Jul 29, 2014
jreback added a commit that referenced this pull request Jul 29, 2014
BUG: fix multi-column sort that includes Categoricals / concat (GH7848/GH7864)
@jreback jreback merged commit 9857a0e into pandas-dev:master Jul 29, 2014
@jreback
Copy link
Contributor Author

jreback commented Jul 29, 2014

@JanSchulz I merged this in. After @cpcloud fix, then can deal (in your PR), the remaining nan issues.

@has2k1
Copy link
Contributor

has2k1 commented Jul 29, 2014

Thanks. Categorical is a real treat and all the fixes are working well.

@jreback
Copy link
Contributor Author

jreback commented Jul 29, 2014

@has2k1 no thank you for fiinding issues!

@jankatins
Copy link
Contributor

@jreback will try to get the fix in Categorical in in my PR, but i could happen that this takes another week as I'm on a Conference this weekend until Wednesday.

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

no prob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.concat reordering categorical levels lexically Categorical in dataframe is sorted lexically.
3 participants