Categorical fixups #7768

jankatins · 2014-07-16T18:42:14Z

Some fixups for Categoricals.

Maybe change Series.cat after the discussion in API: revisit adding datetime-like ops in Series #7207 ?
remove Series.cat from tab completition if Series is not of dtype category
fix for the "FIXME" in unittests
Look at problems in docs (-> hdf support)
Fixup Comparison thingies...

jreback · 2014-07-16T18:44:12Z

pandas/core/categorical.py

@@ -220,9 +220,17 @@ def __init__(self, values, levels=None, ordered=None, name=None, fastpath=False,
            inferred = com._possibly_infer_to_datetimelike(values)
            if not isinstance(inferred, np.ndarray):
                from pandas.core.series import _sanitize_array
-                values = _sanitize_array(values, None)
+                safe_dtype = None
+                if isinstance(values, list) and np.nan in values:


I think you have to do isnull(values).any() instead of np.nan in values

In[47]: np.nan in [np.nan, 1] Out[47]: True In[48]: np.nan in [2, 1] Out[48]: False

Not sure what's faster: converting to numpy array and doing the isnull(..).any() check or the "is in list" check.

this is too specific a check, instead do this:

dtype = 'object' if isnull(values).any() else None values = _sanitize_array(values, dtype=dtype)

jreback · 2014-07-16T20:50:48Z

see here: jreback@1fc80e6

I don't buy the 2 NaN output case. very odd thing to do. A nan is a nan

jankatins · 2014-07-16T20:51:56Z

This currently does not work:

cat = pd.Categorical([1,2,3, np.nan], levels=[1,2,3])
cat.levels = [1,2,3, np.nan]
cat[1] = np.nan
exp = np.array([0,3,2,-1])
self.assert_numpy_array_equal(cat.codes, exp)

I'm not sure if this here is a bug or actually as intended:

In[18]: idx = _ensure_index([1,2,3,np.nan]) 
In[19]: idx
Out[19]: Float64Index([1.0, 2.0, 3.0, nan], dtype='float64')
In[20]: idx.get_indexer([np.nan])
Out[20]: array([-1])

jreback · 2014-07-16T20:59:49Z

with a Floatindex u can't do that
u have to use hasnans and a check indexer

u really can only have 1 nan if you intend to do any kind of indexing operation rlse how would u know where it goes?

it not a value but an indicator of a missing value

so u can have more than 1 nan but then indexing (eg setitem/getitem) is impossible by value (but u can take by position)

jankatins · 2014-07-16T21:02:30Z

Re "two times nan in describe": R has the same problem with "NA as a level vs NA as missing value": http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

If NA is a level, the way to set a code to be missing (as opposed to the code of the missing level) is to use is.na on the left-hand-side of an assignment (as in is.na(f)[i] <- TRUE; indexing inside is.na does not work). Under those circumstances missing values are currently printed as , i.e., identical to entries of level NA.

R does the same thing, printing NA twice, but once indicating that it is a level <NA> and once that it is missing NA's:

> f <- factor(c(1,2,3,NA), exclude=F)
> f
[1] 1    2    3    <NA>
Levels: 1 2 3 <NA>
> is.na(f)[1] <- TRUE
> f
[1] <NA> 2    3    <NA>
Levels: 1 2 3 <NA>
> table(f)
f
   1    2    3 <NA> 
   0    1    1    1 
> summary(f)
   1    2    3 <NA> NA's 
   0    1    1    1    1

jreback · 2014-07-23T22:38:23Z

pandas/core/categorical.py

@@ -120,9 +120,9 @@ class Categorical(PandasObject):

    Attributes
    ----------
-    levels : ndarray
+    levels : Index


why don't u say both of these are index-like (very confusing before and now)

cat.levels returns a Index, but cat.codes returns a (readonly) numpy.array. codes : index implies certain methods, which are not present!?

So I think this is correct now...

they are both Indexes (at least in the currentl impl) - they don't necessarily have to be but they are

Not sure what makes it "index-like", but currently cat._levels and the public accessor cat.levels is an index, but cat._codes and the public accessor cat.codes is a np.ndarray:

In[47]: type(pd.Categorical([1]).codes) Out[47]: numpy.ndarray In[48]: type(pd.Categorical([1])._codes) Out[48]: numpy.ndarray In[49]: type(pd.Categorical([1])._levels) Out[49]: pandas.core.index.Int64Index

Internally, _code is mutable (at least in __set_item__()) , but _levels is always replaced, so this fits the ndarray vs index.

jankatins · 2014-07-23T22:41:23Z

@jreback Please have a look ... I added a workaround for NaN in index in 39a7531and implemented all the stuff which came up in the end of #7217.

So, if Travis passes and you are satisfied, I think this is ready to merge.

jreback · 2014-07-23T22:42:14Z

I want to fix the issue with nan first

jankatins · 2014-07-23T22:47:31Z

pandas/core/categorical.py

+                # assignment step.
+                # tuple are list_like but com.isnull(<tuple>) will return a single bool,
+                # which then raises an AttributeError: 'bool' object has no attribute 'any'
+                has_null = (com.is_list_like(values) and not isinstance(values, tuple)


BTW: did I use the right method to test for lists? OI had expected that if com.is_list_like(a): np.something(a) would always work, but not if a is a tuple :-(

let me have a look ; you are jumping through too many hoops here

jreback · 2014-08-08T12:23:36Z

@JanSchulz relating to your comment from #7207
What more do you want here? I think this groups very nicely, don't show excess stuff.

In [1]: s = Series(list('aabbccd')).astype('category')

In [2]: s
Out[2]: 
0    a
1    a
2    b
3    b
4    c
5    c
6    d
dtype: category
Levels (4, object): [a < b < c < d]

In [3]: s.values.
s.values.T                     s.values.describe              s.values.from_array            s.values.max                   s.values.order                 s.values.reorder_levels        s.values.take_nd               
s.values.argsort               s.values.dtype                 s.values.from_codes            s.values.min                   s.values.ordered               s.values.shape                 s.values.to_dense              
s.values.codes                 s.values.equals                s.values.get_values            s.values.mode                  s.values.ravel                 s.values.sort                  s.values.unique                
s.values.copy                  s.values.fillna                s.values.levels                s.values.ndim                  s.values.remove_unused_levels  s.values.take                  s.values.view                  

In [3]: s.values.

jankatins · 2014-08-08T16:24:47Z

[The follwoing is just about Series.cat, not abaout Series.values which should be the underlying data structure (either numpy.array or pd.Categorical)]

from my perspective, the above has way to many methods visible (and some relevant are not even visible): when you have a Series of type category, only a few methods should be available under Series.cat: .levels .reorder_levels(), .remove_unused_levels() and .codes. Everything else is an implementation detail and the functionality is exposed via the normal Series methods (unique, min, max,...).

Only declaring the above 4 methods/properties as API (and only exposing them) will mean that we can switch to a numpy category implementation (which I expect to happen) much more easily without breaking user code and it makes for a nicer tab completion...

jreback · 2014-08-08T16:32:16Z

numpy cat implementation is pie-in-the-sky. if it happens great. but don't hold your breath. And if it actually does, and we actually switch to it (a big IF, why would it really be any different/better?), so we make a change, so what.

what methods are relevant but not visible?

what about .codes,fillna,copy,astype these seem relevant?

yes they are available (and delegate from series). but as a user wouldn't I expect to see what I can do with the object?

jankatins · 2014-08-08T16:58:35Z

.codes is, but .fillna is done via Series.fillna() and not via Series.cat.fillna() (same for copy and astype). If you know what you are doing, fine, just go for Series.values and do whatever you want with it, but you also have to watch out for bugs then.

From my perspective (using pandas for some not-so-advanced data munging and statistics) I never noticed how numpy arrays are wrapped until I worked on Categorical, so I also never tried to work with Series.values and I also think it would be wrong to use methods on the numpy array directly because that would (as far as my understanding goes) lead to strange results like in the Series(numpyarray)-case where you can manipulate the numpyarray and the Series changes in some cases (depending on view or not).

I'm also a fan of only promoting to API as few methods/properties as possible to have some room for future developments, but that's probably not as problematic in python as it is in java.

jreback · 2014-08-08T17:04:17Z

ok, I am on-board with this then. this is actually really easy,

just provide a new Categorical.__dir__() with the methods/properties you want (just a list of strings)

It was a bit non-trivial with datetime, because ipython search all of the base classes (which complicates things). Here base class is pretty trivial so its not an issue (and they ignore anything that starts with '_')).

and need a test as well (see what I did for the .dt tests)

jankatins · 2014-08-08T17:08:39Z

But that has a problem too: Looking at the Categorical object is like looking at the numpy array and that object should expose every method. So S.cat.<tab> should give you a different list of methods than s.values.<tab> or Categorical(...).<tab>. So I still think it makes sense to replace s.cat with a similar object like it is now done with s.dt (which has the same objective to hide the internals of the implementation from the user).

jreback · 2014-08-08T17:24:39Z

oh, I c what you want to do, like a CategoricalDelegate or something that is like DatetimelikeProperties, delgating as needed (but this is only for a limited purpose).

ok then go for it, should be straightforward.

jankatins · 2014-08-08T17:30:30Z

I current try my idea with the "generate the dt accessors" and if that works I will base the cat access on that.

jreback · 2014-08-08T17:33:03Z

I already did it. https://github.com/pydata/pandas/pull/7953/files

jankatins · 2014-08-08T17:36:37Z

Just saw it. Damn, you are too fast! :-)

If _add_accessors(cls) get's changed to _add_accessors(cls, names), this will work for cat as well. Then it can also be put into a more generic place.

In the normal constructor `ordered=True` is only assumed if the levels are given or the values are sortable (which is most of the cases), but in `from_codes(...)` we can't asssume this so the default should be `False`.

s.values is the underlying Categorical object, s.cat will be changed to only expose the API methods/properties.

….array Categorical can only be comapred to another Categorical with the same levels and the same ordering or to a scalar value. If the Categorical has no order defined (cat.ordered == False), only equal (and not equal) are defined.

jankatins · 2014-08-11T22:23:47Z

@jreback So, I hope I got every unittests regression... If you (or anybody else :-) ) have any comments, I will address them tomorrow, I'm off to bed... :-)

jreback · 2014-08-11T22:30:32Z

@JanSchulz haha, ok will take a look

jreback · 2014-08-12T01:52:08Z

doc/source/api.rst

@@ -547,7 +553,7 @@ the Categorical back to a numpy array, so levels and order information is not pr
   Categorical.__array__

 To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
-are also introduced.


what does this mean?

non-API = they may change, so don't rely on them in production code

jreback · 2014-08-12T13:03:08Z

https://github.com/jreback/pandas/tree/cats

added a couple of commits to fix docs / reorg isnull

otherwise looks ok

give a look and I will rebase and merge

jreback · 2014-08-12T15:26:55Z

I'll have to fix this when we merge: https://travis-ci.org/jreback/pandas/jobs/32332841

jreback · 2014-08-12T16:18:56Z

closing in favor of #8006

jankatins · 2014-08-12T17:12:58Z

Argh, this was so nicely put into different commits :-(

jreback · 2014-08-12T17:13:49Z

doesn't matter it has to get squashed anyhow. (that's just the convention)

jankatins · 2014-08-12T17:29:46Z

Why the squash? I always thought that's because of the problems with rebasing (which I did in my branch via patch export and apply).

I think all commits pass the unittests and only have one logical step, so they are no "work in progress" commits.

jreback · 2014-08-12T17:34:23Z

I just simpler when looking at the log. It could be a couple, but these are all interelated so just easier.

jankatins · 2014-08-12T18:12:05Z

I've picked your changes (damn squash :-) ), but not the :okexcept: one, as there are too many raised exceptions in the doc which makes it kind of unreadable because some stacktraces are just to distracting and long.

Also squashed the whole together into one commit which lists the logical commits (i.e. the CLN is part of another commit)

Please pull https://github.com/JanSchulz/pandas/tree/categorical_fixups

jankatins · 2014-08-12T18:13:26Z

It seems that this PR is not updated anymore, should I open another PR? Or can you reopen this one?

jreback · 2014-08-12T18:14:02Z

hmm, why don't you open a new one. not sure why its not updating.

jankatins · 2014-08-12T18:14:51Z

#8007

jreback added Categorical labels Jul 16, 2014

jreback added this to the 0.15.0 milestone Jul 16, 2014

jreback reviewed Jul 16, 2014
View reviewed changes

jankatins mentioned this pull request Jul 16, 2014

Categoricals with NaNs #3678

Closed

jankatins changed the title ~~Categorical: preserve ints when NaN are present~~ Categorical and NaN fixups Jul 16, 2014

This was referenced Jul 23, 2014

Index([1,2,.np.nan]).get_indexer([np.nan]) returns wrong value? #7820

Closed

WIP: categoricals as an internal CategoricalBlock GH5313 #7217

Merged

jankatins changed the title ~~Categorical and NaN fixups~~ Categorical fixups Jul 23, 2014

jreback reviewed Jul 23, 2014
View reviewed changes

jankatins reviewed Jul 23, 2014
View reviewed changes

jreback mentioned this pull request Jul 26, 2014

BUG: fix multi-column sort that includes Categoricals / concat (GH7848/GH7864) #7850

Merged

has2k1 mentioned this pull request Jul 29, 2014

pd.concat reordering categorical levels lexically #7864

Closed

jankatins mentioned this pull request Aug 8, 2014

API: revisit adding datetime-like ops in Series #7207

Closed

jankatins added 6 commits August 12, 2014 00:18

API: change default Categorical.from_codes() to ordered=False

65d9d6e

In the normal constructor `ordered=True` is only assumed if the levels are given or the values are sortable (which is most of the cases), but in `from_codes(...)` we can't asssume this so the default should be `False`.

Categorical: add some links to Categorical in the other docs

5c4f1bd

Categorical: use s.values when calling private methods

0438a30

s.values is the underlying Categorical object, s.cat will be changed to only expose the API methods/properties.

Categorical: Change series.cat to only expose the API

19f4d46

Categorical: Fix order and na_position

47953a2

jreback reviewed Aug 12, 2014
View reviewed changes

jreback mentioned this pull request Aug 12, 2014

CLN/DOC/TST: categorical fixups (GH7768) #8006

Closed

jreback closed this Aug 12, 2014

jankatins mentioned this pull request Aug 12, 2014

CLN/DOC/TST: Categorical fixups (GH7768) #8007

Merged

Categorical fixups #7768

Categorical fixups #7768

Conversation

jankatins commented Jul 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 16, 2014

jankatins commented Jul 16, 2014

jreback commented Jul 16, 2014

jankatins commented Jul 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jankatins commented Jul 23, 2014

jreback commented Jul 23, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 8, 2014

jankatins commented Aug 8, 2014

jreback commented Aug 8, 2014

jankatins commented Aug 8, 2014

jreback commented Aug 8, 2014

jankatins commented Aug 8, 2014

jreback commented Aug 8, 2014

jankatins commented Aug 8, 2014

jreback commented Aug 8, 2014

jankatins commented Aug 8, 2014

jankatins commented Aug 11, 2014

jreback commented Aug 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 12, 2014

jreback commented Aug 12, 2014

jreback commented Aug 12, 2014

jankatins commented Aug 12, 2014

jreback commented Aug 12, 2014

jankatins commented Aug 12, 2014

jreback commented Aug 12, 2014

jankatins commented Aug 12, 2014

jankatins commented Aug 12, 2014

jreback commented Aug 12, 2014

jankatins commented Aug 12, 2014