-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical fixups #7768
Categorical fixups #7768
Conversation
@@ -220,9 +220,17 @@ def __init__(self, values, levels=None, ordered=None, name=None, fastpath=False, | |||
inferred = com._possibly_infer_to_datetimelike(values) | |||
if not isinstance(inferred, np.ndarray): | |||
from pandas.core.series import _sanitize_array | |||
values = _sanitize_array(values, None) | |||
safe_dtype = None | |||
if isinstance(values, list) and np.nan in values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to do isnull(values).any()
instead of np.nan in values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In[47]: np.nan in [np.nan, 1]
Out[47]: True
In[48]: np.nan in [2, 1]
Out[48]: False
Not sure what's faster: converting to numpy array and doing the isnull(..).any()
check or the "is in list" check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too specific a check, instead do this:
dtype = 'object' if isnull(values).any() else None
values = _sanitize_array(values, dtype=dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
see here: jreback@1fc80e6 I don't buy the 2 NaN output case. very odd thing to do. A nan is a nan |
This currently does not work: cat = pd.Categorical([1,2,3, np.nan], levels=[1,2,3])
cat.levels = [1,2,3, np.nan]
cat[1] = np.nan
exp = np.array([0,3,2,-1])
self.assert_numpy_array_equal(cat.codes, exp) I'm not sure if this here is a bug or actually as intended:
|
with a Floatindex u can't do that u really can only have 1 nan if you intend to do any kind of indexing operation rlse how would u know where it goes? it not a value but an indicator of a missing value so u can have more than 1 nan but then indexing (eg setitem/getitem) is impossible by value (but u can take by position) |
Re "two times nan in describe": R has the same problem with "NA as a level vs NA as missing value": http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
R does the same thing, printing NA twice, but once indicating that it is a level > f <- factor(c(1,2,3,NA), exclude=F)
> f
[1] 1 2 3 <NA>
Levels: 1 2 3 <NA>
> is.na(f)[1] <- TRUE
> f
[1] <NA> 2 3 <NA>
Levels: 1 2 3 <NA>
> table(f)
f
1 2 3 <NA>
0 1 1 1
> summary(f)
1 2 3 <NA> NA's
0 1 1 1 1 |
@@ -120,9 +120,9 @@ class Categorical(PandasObject): | |||
|
|||
Attributes | |||
---------- | |||
levels : ndarray | |||
levels : Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't u say both of these are index-like (very confusing before and now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat.levels
returns a Index
, but cat.codes
returns a (readonly) numpy.array
. codes : index
implies certain methods, which are not present!?
So I think this is correct now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are both Indexes (at least in the currentl impl) - they don't necessarily have to be but they are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what makes it "index-like", but currently cat._levels
and the public accessor cat.levels
is an index, but cat._codes
and the public accessor cat.codes
is a np.ndarray
:
In[47]: type(pd.Categorical([1]).codes)
Out[47]: numpy.ndarray
In[48]: type(pd.Categorical([1])._codes)
Out[48]: numpy.ndarray
In[49]: type(pd.Categorical([1])._levels)
Out[49]: pandas.core.index.Int64Index
Internally, _code
is mutable (at least in __set_item__()
) , but _levels
is always replaced, so this fits the ndarray vs index.
I want to fix the issue with nan first |
# assignment step. | ||
# tuple are list_like but com.isnull(<tuple>) will return a single bool, | ||
# which then raises an AttributeError: 'bool' object has no attribute 'any' | ||
has_null = (com.is_list_like(values) and not isinstance(values, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: did I use the right method to test for lists? OI had expected that if com.is_list_like(a): np.something(a)
would always work, but not if a is a tuple :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me have a look ; you are jumping through too many hoops here
@JanSchulz relating to your comment from #7207
|
[The follwoing is just about from my perspective, the above has way to many methods visible (and some relevant are not even visible): when you have a Series of type Only declaring the above 4 methods/properties as API (and only exposing them) will mean that we can switch to a numpy category implementation (which I expect to happen) much more easily without breaking user code and it makes for a nicer tab completion... |
numpy cat implementation is pie-in-the-sky. if it happens great. but don't hold your breath. And if it actually does, and we actually switch to it (a big IF, why would it really be any different/better?), so we make a change, so what. what methods are relevant but not visible? what about yes they are available (and delegate from series). but as a user wouldn't I expect to see what I can do with the object? |
From my perspective (using pandas for some not-so-advanced data munging and statistics) I never noticed how numpy arrays are wrapped until I worked on Categorical, so I also never tried to work with I'm also a fan of only promoting to API as few methods/properties as possible to have some room for future developments, but that's probably not as problematic in python as it is in java. |
ok, I am on-board with this then. this is actually really easy, just provide a new It was a bit non-trivial with datetime, because ipython search all of the base classes (which complicates things). Here base class is pretty trivial so its not an issue (and they ignore anything that starts with '_')). and need a test as well (see what I did for the .dt tests) |
But that has a problem too: Looking at the |
oh, I c what you want to do, like a ok then go for it, should be straightforward. |
I current try my idea with the "generate the dt accessors" and if that works I will base the cat access on that. |
I already did it. https://github.com/pydata/pandas/pull/7953/files |
Just saw it. Damn, you are too fast! :-) If |
In the normal constructor `ordered=True` is only assumed if the levels are given or the values are sortable (which is most of the cases), but in `from_codes(...)` we can't asssume this so the default should be `False`.
s.values is the underlying Categorical object, s.cat will be changed to only expose the API methods/properties.
….array Categorical can only be comapred to another Categorical with the same levels and the same ordering or to a scalar value. If the Categorical has no order defined (cat.ordered == False), only equal (and not equal) are defined.
@jreback So, I hope I got every unittests regression... If you (or anybody else :-) ) have any comments, I will address them tomorrow, I'm off to bed... :-) |
@JanSchulz haha, ok will take a look |
@@ -547,7 +553,7 @@ the Categorical back to a numpy array, so levels and order information is not pr | |||
Categorical.__array__ | |||
|
|||
To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods | |||
are also introduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-API = they may change, so don't rely on them in production code
https://github.com/jreback/pandas/tree/cats added a couple of commits to fix docs / reorg isnull otherwise looks ok give a look and I will rebase and merge |
I'll have to fix this when we merge: https://travis-ci.org/jreback/pandas/jobs/32332841 |
closing in favor of #8006 |
Argh, this was so nicely put into different commits :-( |
doesn't matter it has to get squashed anyhow. (that's just the convention) |
Why the squash? I always thought that's because of the problems with rebasing (which I did in my branch via patch export and apply). I think all commits pass the unittests and only have one logical step, so they are no "work in progress" commits. |
I just simpler when looking at the log. It could be a couple, but these are all interelated so just easier. |
I've picked your changes (damn squash :-) ), but not the Also squashed the whole together into one commit which lists the logical commits (i.e. the CLN is part of another commit) Please pull https://github.com/JanSchulz/pandas/tree/categorical_fixups |
It seems that this PR is not updated anymore, should I open another PR? Or can you reopen this one? |
hmm, why don't you open a new one. not sure why its not updating. |
Some fixups for Categoricals.
Fixes: #3678