Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical fixups #7768

Closed
wants to merge 10 commits into from
Closed

Conversation

jankatins
Copy link
Contributor

Some fixups for Categoricals.

Fixes: #3678

  • Maybe change Series.cat after the discussion in API: revisit adding datetime-like ops in Series #7207 ?
  • remove Series.cat from tab completition if Series is not of dtype category
  • fix for the "FIXME" in unittests
  • Look at problems in docs (-> hdf support)
  • Fixup Comparison thingies...

@jreback jreback added this to the 0.15.0 milestone Jul 16, 2014
@@ -220,9 +220,17 @@ def __init__(self, values, levels=None, ordered=None, name=None, fastpath=False,
inferred = com._possibly_infer_to_datetimelike(values)
if not isinstance(inferred, np.ndarray):
from pandas.core.series import _sanitize_array
values = _sanitize_array(values, None)
safe_dtype = None
if isinstance(values, list) and np.nan in values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to do isnull(values).any() instead of np.nan in values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In[47]: np.nan in [np.nan, 1]
Out[47]: True
In[48]: np.nan in [2, 1]
Out[48]: False

Not sure what's faster: converting to numpy array and doing the isnull(..).any() check or the "is in list" check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too specific a check, instead do this:

dtype = 'object' if isnull(values).any() else None
values = _sanitize_array(values, dtype=dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

see here: jreback@1fc80e6

I don't buy the 2 NaN output case. very odd thing to do. A nan is a nan

@jankatins
Copy link
Contributor Author

This currently does not work:

cat = pd.Categorical([1,2,3, np.nan], levels=[1,2,3])
cat.levels = [1,2,3, np.nan]
cat[1] = np.nan
exp = np.array([0,3,2,-1])
self.assert_numpy_array_equal(cat.codes, exp)

I'm not sure if this here is a bug or actually as intended:

In[18]: idx = _ensure_index([1,2,3,np.nan]) 
In[19]: idx
Out[19]: Float64Index([1.0, 2.0, 3.0, nan], dtype='float64')
In[20]: idx.get_indexer([np.nan])
Out[20]: array([-1])

@jreback
Copy link
Contributor

jreback commented Jul 16, 2014

with a Floatindex u can't do that
u have to use hasnans and a check indexer

u really can only have 1 nan if you intend to do any kind of indexing operation rlse how would u know where it goes?

it not a value but an indicator of a missing value

so u can have more than 1 nan but then indexing (eg setitem/getitem) is impossible by value (but u can take by position)

@jankatins
Copy link
Contributor Author

Re "two times nan in describe": R has the same problem with "NA as a level vs NA as missing value": http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

If NA is a level, the way to set a code to be missing (as opposed to the code of the missing level) is to use is.na on the left-hand-side of an assignment (as in is.na(f)[i] <- TRUE; indexing inside is.na does not work). Under those circumstances missing values are currently printed as , i.e., identical to entries of level NA.

R does the same thing, printing NA twice, but once indicating that it is a level <NA> and once that it is missing NA's:

> f <- factor(c(1,2,3,NA), exclude=F)
> f
[1] 1    2    3    <NA>
Levels: 1 2 3 <NA>
> is.na(f)[1] <- TRUE
> f
[1] <NA> 2    3    <NA>
Levels: 1 2 3 <NA>
> table(f)
f
   1    2    3 <NA> 
   0    1    1    1 
> summary(f)
   1    2    3 <NA> NA's 
   0    1    1    1    1 

@jankatins jankatins changed the title Categorical: preserve ints when NaN are present Categorical and NaN fixups Jul 16, 2014
@jankatins jankatins changed the title Categorical and NaN fixups Categorical fixups Jul 23, 2014
@@ -120,9 +120,9 @@ class Categorical(PandasObject):

Attributes
----------
levels : ndarray
levels : Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't u say both of these are index-like (very confusing before and now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat.levels returns a Index, but cat.codes returns a (readonly) numpy.array. codes : index implies certain methods, which are not present!?

So I think this is correct now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are both Indexes (at least in the currentl impl) - they don't necessarily have to be but they are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what makes it "index-like", but currently cat._levels and the public accessor cat.levels is an index, but cat._codes and the public accessor cat.codes is a np.ndarray:

In[47]: type(pd.Categorical([1]).codes)
Out[47]: numpy.ndarray
In[48]: type(pd.Categorical([1])._codes)
Out[48]: numpy.ndarray
In[49]: type(pd.Categorical([1])._levels)
Out[49]: pandas.core.index.Int64Index

Internally, _code is mutable (at least in __set_item__()) , but _levels is always replaced, so this fits the ndarray vs index.

@jankatins
Copy link
Contributor Author

@jreback Please have a look ... I added a workaround for NaN in index in 39a7531and implemented all the stuff which came up in the end of #7217.

So, if Travis passes and you are satisfied, I think this is ready to merge.

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

I want to fix the issue with nan first

# assignment step.
# tuple are list_like but com.isnull(<tuple>) will return a single bool,
# which then raises an AttributeError: 'bool' object has no attribute 'any'
has_null = (com.is_list_like(values) and not isinstance(values, tuple)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: did I use the right method to test for lists? OI had expected that if com.is_list_like(a): np.something(a) would always work, but not if a is a tuple :-(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me have a look ; you are jumping through too many hoops here

@jreback
Copy link
Contributor

jreback commented Aug 8, 2014

@JanSchulz relating to your comment from #7207
What more do you want here? I think this groups very nicely, don't show excess stuff.

In [1]: s = Series(list('aabbccd')).astype('category')

In [2]: s
Out[2]: 
0    a
1    a
2    b
3    b
4    c
5    c
6    d
dtype: category
Levels (4, object): [a < b < c < d]

In [3]: s.values.
s.values.T                     s.values.describe              s.values.from_array            s.values.max                   s.values.order                 s.values.reorder_levels        s.values.take_nd               
s.values.argsort               s.values.dtype                 s.values.from_codes            s.values.min                   s.values.ordered               s.values.shape                 s.values.to_dense              
s.values.codes                 s.values.equals                s.values.get_values            s.values.mode                  s.values.ravel                 s.values.sort                  s.values.unique                
s.values.copy                  s.values.fillna                s.values.levels                s.values.ndim                  s.values.remove_unused_levels  s.values.take                  s.values.view                  

In [3]: s.values.

@jankatins
Copy link
Contributor Author

[The follwoing is just about Series.cat, not abaout Series.values which should be the underlying data structure (either numpy.array or pd.Categorical)]

from my perspective, the above has way to many methods visible (and some relevant are not even visible): when you have a Series of type category, only a few methods should be available under Series.cat: .levels .reorder_levels(), .remove_unused_levels() and .codes. Everything else is an implementation detail and the functionality is exposed via the normal Series methods (unique, min, max,...).

Only declaring the above 4 methods/properties as API (and only exposing them) will mean that we can switch to a numpy category implementation (which I expect to happen) much more easily without breaking user code and it makes for a nicer tab completion...

@jreback
Copy link
Contributor

jreback commented Aug 8, 2014

numpy cat implementation is pie-in-the-sky. if it happens great. but don't hold your breath. And if it actually does, and we actually switch to it (a big IF, why would it really be any different/better?), so we make a change, so what.

what methods are relevant but not visible?

what about .codes,fillna,copy,astype these seem relevant?

yes they are available (and delegate from series). but as a user wouldn't I expect to see what I can do with the object?

@jankatins
Copy link
Contributor Author

.codes is, but .fillna is done via Series.fillna() and not via Series.cat.fillna() (same for copy and astype). If you know what you are doing, fine, just go for Series.values and do whatever you want with it, but you also have to watch out for bugs then.

From my perspective (using pandas for some not-so-advanced data munging and statistics) I never noticed how numpy arrays are wrapped until I worked on Categorical, so I also never tried to work with Series.values and I also think it would be wrong to use methods on the numpy array directly because that would (as far as my understanding goes) lead to strange results like in the Series(numpyarray)-case where you can manipulate the numpyarray and the Series changes in some cases (depending on view or not).

I'm also a fan of only promoting to API as few methods/properties as possible to have some room for future developments, but that's probably not as problematic in python as it is in java.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2014

ok, I am on-board with this then. this is actually really easy,

just provide a new Categorical.__dir__() with the methods/properties you want (just a list of strings)

It was a bit non-trivial with datetime, because ipython search all of the base classes (which complicates things). Here base class is pretty trivial so its not an issue (and they ignore anything that starts with '_')).

and need a test as well (see what I did for the .dt tests)

@jankatins
Copy link
Contributor Author

But that has a problem too: Looking at the Categorical object is like looking at the numpy array and that object should expose every method. So S.cat.<tab> should give you a different list of methods than s.values.<tab> or Categorical(...).<tab>. So I still think it makes sense to replace s.cat with a similar object like it is now done with s.dt (which has the same objective to hide the internals of the implementation from the user).

@jreback
Copy link
Contributor

jreback commented Aug 8, 2014

oh, I c what you want to do, like a CategoricalDelegate or something that is like DatetimelikeProperties, delgating as needed (but this is only for a limited purpose).

ok then go for it, should be straightforward.

@jankatins
Copy link
Contributor Author

I current try my idea with the "generate the dt accessors" and if that works I will base the cat access on that.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2014

@jankatins
Copy link
Contributor Author

Just saw it. Damn, you are too fast! :-)

If _add_accessors(cls) get's changed to _add_accessors(cls, names), this will work for cat as well. Then it can also be put into a more generic place.

In the normal constructor `ordered=True` is only assumed if the levels
are given or the values are sortable (which is most of the cases), but
in `from_codes(...)` we can't asssume this so the default should be
`False`.
s.values is the underlying Categorical object, s.cat will be changed
to only expose the API methods/properties.
….array

Categorical can only be comapred to another Categorical with the same levels
and the same ordering or to a scalar value.

If the Categorical has no order defined (cat.ordered == False), only equal
(and not equal) are defined.
@jankatins
Copy link
Contributor Author

@jreback So, I hope I got every unittests regression... If you (or anybody else :-) ) have any comments, I will address them tomorrow, I'm off to bed... :-)

@jreback
Copy link
Contributor

jreback commented Aug 11, 2014

@JanSchulz haha, ok will take a look

@@ -547,7 +553,7 @@ the Categorical back to a numpy array, so levels and order information is not pr
Categorical.__array__

To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
are also introduced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-API = they may change, so don't rely on them in production code

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

https://github.com/jreback/pandas/tree/cats

added a couple of commits to fix docs / reorg isnull

otherwise looks ok

give a look and I will rebase and merge

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

I'll have to fix this when we merge: https://travis-ci.org/jreback/pandas/jobs/32332841

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

closing in favor of #8006

@jreback jreback closed this Aug 12, 2014
@jankatins
Copy link
Contributor Author

Argh, this was so nicely put into different commits :-(

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

doesn't matter it has to get squashed anyhow. (that's just the convention)

@jankatins
Copy link
Contributor Author

Why the squash? I always thought that's because of the problems with rebasing (which I did in my branch via patch export and apply).

I think all commits pass the unittests and only have one logical step, so they are no "work in progress" commits.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

I just simpler when looking at the log. It could be a couple, but these are all interelated so just easier.

@jankatins
Copy link
Contributor Author

I've picked your changes (damn squash :-) ), but not the :okexcept: one, as there are too many raised exceptions in the doc which makes it kind of unreadable because some stacktraces are just to distracting and long.

Also squashed the whole together into one commit which lists the logical commits (i.e. the CLN is part of another commit)

Please pull https://github.com/JanSchulz/pandas/tree/categorical_fixups

@jankatins
Copy link
Contributor Author

It seems that this PR is not updated anymore, should I open another PR? Or can you reopen this one?

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

hmm, why don't you open a new one. not sure why its not updating.

@jankatins
Copy link
Contributor Author

#8007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Categoricals with NaNs
2 participants