Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical fixups #7768

Closed
wants to merge 10 commits into from
29 changes: 28 additions & 1 deletion doc/source/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ Creating a ``DataFrame`` by passing a dict of objects that can be converted to s
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : 'foo' })
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2

Having specific :ref:`dtypes <basics.dtypes>`
Expand Down Expand Up @@ -635,6 +636,32 @@ the quarter end:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

Categoricals
------------

Since version 0.15, pandas can include categorical data in a `DataFrame`. For full docs, see the
:ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>` .

.. ipython:: python

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

# convert the raw grades to a categorical
df["grade"] = pd.Categorical(df["raw_grade"])

# Alternative: df["grade"] = df["raw_grade"].astype("category")
df["grade"]

# Rename the levels
df["grade"].cat.levels = ["very good", "good", "very bad"]

# Reorder the levels and simultaneously add the missing levels
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
df.sort("grade")
df.groupby("grade").size()



Plotting
--------
Expand Down
11 changes: 8 additions & 3 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -528,11 +528,17 @@ and has the following usable methods and properties (all available as
:toctree: generated/

Categorical
Categorical.from_codes
Categorical.levels
Categorical.ordered
Categorical.reorder_levels
Categorical.remove_unused_levels

The following methods are considered API when using ``Categorical`` directly:

.. autosummary::
:toctree: generated/

Categorical.from_codes
Categorical.min
Categorical.max
Categorical.mode
Expand All @@ -547,7 +553,7 @@ the Categorical back to a numpy array, so levels and order information is not pr
Categorical.__array__

To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
are also introduced.
are also introduced and available when ``Categorical`` is used directly.

.. autosummary::
:toctree: generated/
Expand All @@ -564,7 +570,6 @@ are also introduced.
Categorical.argsort
Categorical.fillna


Plotting
~~~~~~~~
.. currentmodule:: pandas
Expand Down
45 changes: 43 additions & 2 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ By using some special functions:
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(10)

See :ref:`documentation <reshaping.tile.cut>` for :func:`~pandas.cut`.

`Categoricals` have a specific ``category`` :ref:`dtype <basics.dtypes>`:

Expand Down Expand Up @@ -331,6 +332,45 @@ Operations

The following operations are possible with categorical data:

Comparing `Categoricals` with other objects is possible in two cases:
* comparing a `Categorical` to another `Categorical`, when `level` and `ordered` is the same or
* comparing a `Categorical` to a scalar.
All other comparisons will raise a TypeError.

.. ipython:: python

cat = pd.Series(pd.Categorical([1,2,3], levels=[3,2,1]))
cat_base = pd.Series(pd.Categorical([2,2,2], levels=[3,2,1]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show the cats after they are created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

cat_base2 = pd.Series(pd.Categorical([2,2,2]))

cat > cat_base

# This doesn't work because the levels are not the same
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

their is a way to do this in the docs (showing an exception); can also do a code block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found a way to do that. Just letting the exception happen results in long stacktraces and I don't like codeblocks, where the exception message has to be manually inserted (and maintained). Maybe that would be a nice PR for the ipython directive....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that's fine (I bet their is a way with :okexcept: though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the sphinx extension source and don't think there is a way without modifying it. `:okexcept:' basically only prevents sphinx to write the exception to stdout.

A :nostacktrace: option would be nice...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can create a small function and put in utils for this purpose (basically what u r doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

with no_stacktrace():
   a < cat

cat > cat_base2
except TypeError as e:
print("TypeError: " + str(e))

cat > 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a comment above (eg comparison vs scalar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


.. note::

Comparisons with `Series`, `np.array` or a `Categorical` with different levels or ordering
will raise an `TypeError` because custom level ordering would result in two valid results:
one with taking in account the ordering and one without. If you want to compare a `Categorical`
with such a type, you need to be explicit and convert the `Categorical` to values:

.. ipython:: python

base = np.array([1,2,3])

try:
cat > base
except TypeError as e:
print("TypeError: " + str(e))

np.asarray(cat) > base

Getting the minimum and maximum, if the categorical is ordered:

.. ipython:: python
Expand Down Expand Up @@ -509,7 +549,8 @@ The same applies to ``df.append(df)``.
Getting Data In/Out
-------------------

Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently raise ``NotImplementedError``.
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently
raise ``NotImplementedError``.

Writing to a CSV file will convert the data, effectively removing any information about the
`Categorical` (levels and ordering). So if you read back the CSV file you have to convert the
Expand Down Expand Up @@ -579,7 +620,7 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
try:
np.dtype("category")
except TypeError as e:
print("TypeError: " + str(e))
print("TypeError: " + str(e))

dtype = pd.Categorical(["a"]).dtype
try:
Expand Down
7 changes: 7 additions & 0 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -503,3 +503,10 @@ handling of NaN:

pd.factorize(x, sort=True)
np.unique(x, return_inverse=True)[::-1]

.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`. This feature was introduced in version 0.15.
3 changes: 2 additions & 1 deletion doc/source/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,8 @@ Categoricals in Series/DataFrame
methods to manipulate. Thanks to Jan Schultz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`).

For full docs, see the :ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>`.
For full docs, see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

.. ipython:: python

Expand Down
Loading