Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN/DOC/TST: Categorical fixups (GH7768) #8007

Merged
merged 1 commit into from
Aug 19, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion doc/source/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ Creating a ``DataFrame`` by passing a dict of objects that can be converted to s
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : 'foo' })
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2

Having specific :ref:`dtypes <basics.dtypes>`
Expand Down Expand Up @@ -635,6 +636,32 @@ the quarter end:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

Categoricals
------------

Since version 0.15, pandas can include categorical data in a ``DataFrame``. For full docs, see the
:ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>` .

.. ipython:: python

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

# convert the raw grades to a categorical
df["grade"] = pd.Categorical(df["raw_grade"])

# Alternative: df["grade"] = df["raw_grade"].astype("category")
df["grade"]

# Rename the levels
df["grade"].cat.levels = ["very good", "good", "very bad"]

# Reorder the levels and simultaneously add the missing levels
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
df.sort("grade")
df.groupby("grade").size()



Plotting
--------
Expand Down
36 changes: 9 additions & 27 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -521,51 +521,33 @@ Categorical
.. currentmodule:: pandas.core.categorical

If the Series is of dtype ``category``, ``Series.cat`` can be used to access the the underlying
``Categorical``. This data type is similar to the otherwise underlying numpy array
and has the following usable methods and properties (all available as
``Series.cat.<method_or_property>``).

``Categorical``. This accessor is similar to the ``Series.dt`` or ``Series.str``and has the
following usable methods and properties (all available as ``Series.cat.<method_or_property>``).

.. autosummary::
:toctree: generated/

Categorical
Categorical.from_codes
Categorical.levels
Categorical.ordered
Categorical.reorder_levels
Categorical.remove_unused_levels
Categorical.min
Categorical.max
Categorical.mode
Categorical.describe

``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
the Categorical back to a numpy array, so levels and order information is not preserved!
The following methods are considered API when using ``Categorical`` directly:

.. autosummary::
:toctree: generated/

Categorical.__array__
Categorical
Categorical.from_codes
Categorical.codes

To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
are also introduced.
``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
the Categorical back to a numpy array, so levels and order information is not preserved!

.. autosummary::
:toctree: generated/

Categorical.from_array
Categorical.get_values
Categorical.copy
Categorical.dtype
Categorical.ndim
Categorical.sort
Categorical.equals
Categorical.unique
Categorical.order
Categorical.argsort
Categorical.fillna

Categorical.__array__

Plotting
~~~~~~~~
Expand Down
111 changes: 93 additions & 18 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ By using some special functions:
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(10)

See :ref:`documentation <reshaping.tile.cut>` for :func:`~pandas.cut`.

`Categoricals` have a specific ``category`` :ref:`dtype <basics.dtypes>`:

Expand Down Expand Up @@ -331,6 +332,57 @@ Operations

The following operations are possible with categorical data:

Comparing `Categoricals` with other objects is possible in two cases:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for code, you should use doube backticks (````) instead of a single one to get it rendered as code.

BTW, I think it is possible to set sphinx to render a single quoted the same as double if we would want that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently this is done as "single quotes for names/types" and "double quotes for code". As such the above Categoricals is consistent.

Is that rule not right? If so most of the rst has to change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should check it, but I think in most of our docs, also double quotes are used for names/types, or no quotes at all (eg Series and DataFrame are often put in double quotes a few times, but then for the rest just used as ordinary text.).


* comparing a `Categorical` to another `Categorical`, when `level` and `ordered` is the same or
* comparing a `Categorical` to a scalar.

All other comparisons will raise a TypeError.

.. ipython:: python

cat = pd.Series(pd.Categorical([1,2,3], levels=[3,2,1]))
cat_base = pd.Series(pd.Categorical([2,2,2], levels=[3,2,1]))
cat_base2 = pd.Series(pd.Categorical([2,2,2]))

cat
cat_base
cat_base2

Comparing to a categorical with the same levels and ordering or to a scalar works:

.. ipython:: python

cat > cat_base
cat > 2

This doesn't work because the levels are not the same:

.. ipython:: python

try:
cat > cat_base2
except TypeError as e:
print("TypeError: " + str(e))

.. note::

Comparisons with `Series`, `np.array` or a `Categorical` with different levels or ordering
will raise an `TypeError` because custom level ordering would result in two valid results:
one with taking in account the ordering and one without. If you want to compare a `Categorical`
with such a type, you need to be explicit and convert the `Categorical` to values:

.. ipython:: python

base = np.array([1,2,3])

try:
cat > base
except TypeError as e:
print("TypeError: " + str(e))

np.asarray(cat) > base

Getting the minimum and maximum, if the categorical is ordered:

.. ipython:: python
Expand Down Expand Up @@ -489,34 +541,38 @@ but the levels of these `Categoricals` need to be the same:

.. ipython:: python

cat = pd.Categorical(["a","b"], levels=["a","b"])
vals = [1,2]
df = pd.DataFrame({"cats":cat, "vals":vals})
res = pd.concat([df,df])
res
res.dtypes
cat = pd.Categorical(["a","b"], levels=["a","b"])
vals = [1,2]
df = pd.DataFrame({"cats":cat, "vals":vals})
res = pd.concat([df,df])
res
res.dtypes

df_different = df.copy()
df_different["cats"].cat.levels = ["a","b","c"]
In this case the levels are not the same and so an error is raised:

try:
pd.concat([df,df])
except ValueError as e:
print("ValueError: " + str(e))
.. ipython:: python

df_different = df.copy()
df_different["cats"].cat.levels = ["a","b","c"]
try:
pd.concat([df,df_different])
except ValueError as e:
print("ValueError: " + str(e))

The same applies to ``df.append(df)``.

Getting Data In/Out
-------------------

Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently raise ``NotImplementedError``.
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently
raise ``NotImplementedError``.

Writing to a CSV file will convert the data, effectively removing any information about the
`Categorical` (levels and ordering). So if you read back the CSV file you have to convert the
relevant columns back to `category` and assign the right levels and level ordering.

.. ipython:: python
:suppress:
:suppress:

from pandas.compat import StringIO

Expand Down Expand Up @@ -548,7 +604,7 @@ default not included in computations. See the :ref:`Missing Data section
<missing_data>`

There are two ways a `np.nan` can be represented in `Categorical`: either the value is not
available or `np.nan` is a valid level.
available ("missing value") or `np.nan` is a valid level.

.. ipython:: python

Expand All @@ -560,9 +616,25 @@ available or `np.nan` is a valid level.
s2.cat.levels = [1,2,np.nan]
s2
# three levels, np.nan included
# Note: as int arrays can't hold NaN the levels were converted to float
# Note: as int arrays can't hold NaN the levels were converted to object
s2.cat.levels

.. note::
Missing value methods like ``isnull`` and ``fillna`` will take both missing values as well as
`np.nan` levels into account:

.. ipython:: python

c = pd.Categorical(["a","b",np.nan])
c.levels = ["a","b",np.nan]
# will be inserted as a NA level:
c[0] = np.nan
s = pd.Series(c)
s
pd.isnull(s)
s.fillna("a")


Gotchas
-------

Expand All @@ -579,15 +651,18 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
try:
np.dtype("category")
except TypeError as e:
print("TypeError: " + str(e))
print("TypeError: " + str(e))

dtype = pd.Categorical(["a"]).dtype
try:
np.dtype(dtype)
except TypeError as e:
print("TypeError: " + str(e))

# dtype comparisons work:
Dtype comparisons work:

.. ipython:: python

dtype == np.str_
np.str_ == dtype

Expand Down
7 changes: 7 additions & 0 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -505,3 +505,10 @@ handling of NaN:

pd.factorize(x, sort=True)
np.unique(x, return_inverse=True)[::-1]

.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`. This feature was introduced in version 0.15.
5 changes: 3 additions & 2 deletions doc/source/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -283,9 +283,10 @@ Categoricals in Series/DataFrame

:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new
methods to manipulate. Thanks to Jan Schultz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`).
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`, :issue:`7768`, :issue:`8006`, :issue:`3678`).

For full docs, see the :ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>`.
For full docs, see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

.. ipython:: python

Expand Down
Loading