Skip to content

Latest commit

 

History

History
1133 lines (892 loc) · 71.8 KB

v1.3.0.rst

File metadata and controls

1133 lines (892 loc) · 71.8 KB

What's new in 1.3.0 (??)

These are the changes in pandas 1.3.0. See :ref:`release` for a full changelog including other versions of pandas.

{{ header }}

Warning

When reading new Excel 2007+ (.xlsx) files, the default argument engine=None to :func:`~pandas.read_excel` will now result in using the openpyxl engine in all cases when the option :attr:`io.excel.xlsx.reader` is set to "auto". Previously, some cases would use the xlrd engine instead. See :ref:`What's new 1.2.0 <whatsnew_120>` for background on this change.

Enhancements

Custom HTTP(s) headers when reading csv or json files

When reading from a remote URL that is not handled by fsspec (ie. HTTP and HTTPS) the dictionary passed to storage_options will be used to create the headers included in the request. This can be used to control the User-Agent header or send other custom headers (:issue:`36688`). For example:

.. ipython:: python

    headers = {"User-Agent": "pandas"}
    df = pd.read_csv(
        "https://download.bls.gov/pub/time.series/cu/cu.item",
        sep="\t",
        storage_options=headers
    )

Read and write XML documents

We added I/O support to read and render shallow versions of XML documents with :func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser, both XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`)

In [1]: xml = """<?xml version='1.0' encoding='utf-8'?>
   ...: <data>
   ...:  <row>
   ...:     <shape>square</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides>4.0</sides>
   ...:  </row>
   ...:  <row>
   ...:     <shape>circle</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides/>
   ...:  </row>
   ...:  <row>
   ...:     <shape>triangle</shape>
   ...:     <degrees>180</degrees>
   ...:     <sides>3.0</sides>
   ...:  </row>
   ...:  </data>"""

In [2]: df = pd.read_xml(xml)
In [3]: df
Out[3]:
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

In [4]: df.to_xml()
Out[4]:
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

For more, see :ref:`io.xml` in the user guide on IO tools.

Styler Upgrades

We provided some focused development on :class:`.Styler`, including altering methods to accept more universal CSS language for arguments, such as 'color:red;' instead of [('color', 'red')] (:issue:`39564`). This is also added to the built-in methods to allow custom CSS highlighting instead of default background coloring (:issue:`40242`). Enhancements to other built-in methods include extending the :meth:`.Styler.background_gradient` method to shade elements based on a given gradient map and not be restricted only to values in the DataFrame (:issue:`39930` :issue:`22727` :issue:`28901`). Additional built-in methods such as :meth:`.Styler.highlight_between`, :meth:`.Styler.highlight_quantile` and .Styler.text_gradient have been added (:issue:`39821`, :issue:`40926`, :issue:`41098`).

The :meth:`.Styler.apply` now consistently allows functions with ndarray output to allow more flexible development of UDFs when axis is None 0 or 1 (:issue:`39393`).

:meth:`.Styler.set_tooltips` is a new method that allows adding on hover tooltips to enhance interactive displays (:issue:`35643`). :meth:`.Styler.set_td_classes`, which was recently introduced in v1.2.0 (:issue:`36159`) to allow adding specific CSS classes to data cells, has been made as performant as :meth:`.Styler.apply` and :meth:`.Styler.applymap` (:issue:`40453`), if not more performant in some cases. The overall performance of HTML render times has been considerably improved to match :meth:`DataFrame.to_html` (:issue:`39952` :issue:`37792` :issue:`40425`).

The :meth:`.Styler.format` has had upgrades to easily format missing data, precision, and perform HTML escaping (:issue:`40437` :issue:`40134`). There have been numerous other bug fixes to properly format HTML and eliminate some inconsistencies (:issue:`39942` :issue:`40356` :issue:`39807` :issue:`39889` :issue:`39627`)

:class:`.Styler` has also been compatible with non-unique index or columns, at least for as many features as are fully compatible, others made only partially compatible (:issue:`41269`). One also has greater control of the display through separate sparsification of the index or columns, using the new 'styler' options context (:issue:`41142`).

We have added an extension to allow LaTeX styling as an alternative to CSS styling and a method :meth:`.Styler.to_latex` which renders the necessary LaTeX format including built-up styles. An additional file io function :meth:`Styler.to_html` has been added for convenience (:issue:`40312`).

Documentation has also seen major revisions in light of new features (:issue:`39720` :issue:`39317` :issue:`40493`)

DataFrame constructor honors copy=False with dict

When passing a dictionary to :class:`DataFrame` with copy=False, a copy will no longer be made (:issue:`32960`)

.. ipython:: python

    arr = np.array([1, 2, 3])
    df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)
    df

df["A"] remains a view on arr:

.. ipython:: python

    arr[0] = 0
    assert df.iloc[0, 0] == 0

The default behavior when not passing copy will remain unchanged, i.e. a copy will be made.

Centered Datetime-Like Rolling Windows

When performing rolling calculations on :class:`DataFrame` and :class:`Series` objects with a datetime-like index, a centered datetime-like window can now be used (:issue:`38780`). For example:

.. ipython:: python

    df = pd.DataFrame(
        {"A": [0, 1, 2, 3, 4]}, index=pd.date_range("2020", periods=5, freq="1D")
    )
    df
    df.rolling("2D", center=True).mean()


Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Categorical.unique now always maintains same dtype as original

Previously, when calling :meth:`~Categorical.unique` with categorical data, unused categories in the new array would be removed, meaning that the dtype of the new array would be different than the original, if some categories are not present in the unique array (:issue:`18291`)

As an example of this, given:

.. ipython:: python

        dtype = pd.CategoricalDtype(['bad', 'neutral', 'good'], ordered=True)
        cat = pd.Categorical(['good', 'good', 'bad', 'bad'], dtype=dtype)
        original = pd.Series(cat)
        unique = original.unique()

pandas < 1.3.0:

In [1]: unique
['good', 'bad']
Categories (2, object): ['bad' < 'good']
In [2]: original.dtype == unique.dtype
False

pandas >= 1.3.0

.. ipython:: python

        unique
        original.dtype == unique.dtype

:meth:`~pandas.DataFrame.combine_first` will now preserve dtypes (:issue:`7509`)

.. ipython:: python

   df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])
   df1
   df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])
   df2
   combined = df1.combine_first(df2)

pandas 1.2.x

In [1]: combined.dtypes
Out[2]:
A    float64
B    float64
C    float64
dtype: object

pandas 1.3.0

.. ipython:: python

   combined.dtypes

Group by methods agg and transform no longer changes return dtype for callables

Previously the methods :meth:`.DataFrameGroupBy.aggregate`, :meth:`.SeriesGroupBy.aggregate`, :meth:`.DataFrameGroupBy.transform`, and :meth:`.SeriesGroupBy.transform` might cast the result dtype when the argument func is callable, possibly leading to undesirable results (:issue:`21240`). The cast would occur if the result is numeric and casting back to the input dtype does not change any values as measured by np.allclose. Now no such casting occurs.

.. ipython:: python

    df = pd.DataFrame({'key': [1, 1], 'a': [True, False], 'b': [True, True]})
    df

pandas 1.2.x

In [5]: df.groupby('key').agg(lambda x: x.sum())
Out[5]:
        a  b
key
1    True  2

pandas 1.3.0

.. ipython:: python

    df.groupby('key').agg(lambda x: x.sum())

Previously, these methods could result in different dtypes depending on the input values. Now, these methods will always return a float dtype. (:issue:`41137`)

.. ipython:: python

    df = pd.DataFrame({'a': [True], 'b': [1], 'c': [1.0]})

pandas 1.2.x

In [5]: df.groupby(df.index).mean()
Out[5]:
        a  b    c
0    True  1  1.0

pandas 1.3.0

.. ipython:: python

    df.groupby(df.index).mean()

Try operating inplace when setting values with loc and iloc

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

.. ipython:: python

   df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
   values = df.values
   new = np.array([5, 6, 7], dtype="int64")
   df.loc[[0, 1, 2], "A"] = new

In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.

pandas 1.2.x

In [1]: df.dtypes
Out[1]:
A    int64
dtype: object
In [2]: np.shares_memory(df["A"].values, new)
Out[2]: False
In [3]: np.shares_memory(df["A"].values, values)
Out[3]: False

In pandas 1.3.0, df continues to share data with values

pandas 1.3.0

.. ipython:: python

   df.dtypes
   np.shares_memory(df["A"], new)
   np.shares_memory(df["A"], values)


Never Operate Inplace When Setting frame[keys] = values

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (:issue:`39510`). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.

.. ipython:: python

   df = pd.DataFrame(range(3), columns=["A"], dtype="float64")
   df[["A"]] = 5

In the old behavior, 5 was cast to float64 and inserted into the existing array backing df:

pandas 1.2.x

In [1]: df.dtypes
Out[1]:
A    float64

In the new behavior, we get a new array, and retain an integer-dtyped 5:

pandas 1.3.0

.. ipython:: python

   df.dtypes


Consistent Casting With Setting Into Boolean Series

Setting non-boolean values into a :class:`Series with ``dtype=bool`` consistently cast to dtype=object (:issue:`38709`)

.. ipython:: python

   orig = pd.Series([True, False])
   ser = orig.copy()
   ser.iloc[1] = np.nan
   ser2 = orig.copy()
   ser2.iloc[1] = 2.0

pandas 1.2.x

In [1]: ser
Out [1]:
0    1.0
1    NaN
dtype: float64

In [2]:ser2
Out [2]:
0    True
1     2.0
dtype: object

pandas 1.3.0

.. ipython:: python

   ser
   ser2


GroupBy.rolling no longer returns grouped-by column in values

The group-by column will now be dropped from the result of a groupby.rolling operation (:issue:`32262`)

.. ipython:: python

    df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
    df

Previous behavior:

In [1]: df.groupby("A").rolling(2).sum()
Out[1]:
       A    B
A
1 0  NaN  NaN
1    2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN

New behavior:

.. ipython:: python

    df.groupby("A").rolling(2).sum()

Removed artificial truncation in rolling variance and standard deviation

:meth:`core.window.Rolling.std` and :meth:`core.window.Rolling.var` will no longer artificially truncate results that are less than ~1e-8 and ~1e-15 respectively to zero (:issue:`37051`, :issue:`40448`, :issue:`39872`).

However, floating point artifacts may now exist in the results when rolling over larger values.

.. ipython:: python

   s = pd.Series([7, 5, 5, 5])
   s.rolling(3).var()

GroupBy.rolling with MultiIndex no longer drops levels in the result

:class:`core.window.rolling.RollingGroupby` will no longer drop levels of a :class:`DataFrame` with a :class:`MultiIndex` in the result. This can lead to a perceived duplication of levels in the resulting :class:`MultiIndex`, but this change restores the behavior that was present in version 1.1.3 (:issue:`38787`, :issue:`38523`).

.. ipython:: python

   index = pd.MultiIndex.from_tuples([('idx1', 'idx2')], names=['label1', 'label2'])
   df = pd.DataFrame({'a': [1], 'b': [2]}, index=index)
   df

Previous behavior:

In [1]: df.groupby('label1').rolling(1).sum()
Out[1]:
          a    b
label1
idx1    1.0  2.0

New behavior:

.. ipython:: python

    df.groupby('label1').rolling(1).sum()


Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package Minimum Version Required Changed
numpy 1.17.3 X X
pytz 2017.3 X  
python-dateutil 2.7.3 X  
bottleneck 1.2.1    
numexpr 2.7.0   X
pytest (dev) 6.0   X
mypy (dev) 0.800   X
setuptools 38.6.0   X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package Minimum Version Changed
beautifulsoup4 4.6.0  
fastparquet 0.4.0 X
fsspec 0.7.4  
gcsfs 0.6.0  
lxml 4.3.0  
matplotlib 2.2.3  
numba 0.46.0  
openpyxl 3.0.0 X
pyarrow 0.17.0 X
pymysql 0.8.1 X
pytables 3.5.1  
s3fs 0.4.0  
scipy 1.2.0  
sqlalchemy 1.3.0 X
tabulate 0.8.7 X
xarray 0.12.0  
xlrd 1.2.0  
xlsxwriter 1.0.2  
xlwt 1.3.0  
pandas-gbq 0.12.0  

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Other API changes

  • Partially initialized :class:`CategoricalDtype` (i.e. those with categories=None objects will no longer compare as equal to fully initialized dtype objects.
  • Accessing _constructor_expanddim on a :class:`DataFrame` and _constructor_sliced on a :class:`Series` now raise an AttributeError. Previously a NotImplementedError was raised (:issue:`38782`)
  • Added new engine and **engine_kwargs parameters to :meth:`DataFrame.to_sql` to support other future "SQL engines". Currently we still only use SQLAlchemy under the hood, but more engines are planned to be supported such as turbodbc (:issue:`36893`)

Build

  • Documentation in .pptx and .pdf formats are no longer included in wheels or source distributions. (:issue:`30741`)

Deprecations

Deprecated Dropping Nuisance Columns in DataFrame Reductions and DataFrameGroupBy Operations

The default of calling a reduction (.min, .max, .sum, ...) on a :class:`DataFrame` with numeric_only=None (the default, columns on which the reduction raises TypeError are silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

.. ipython:: python

   df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})
   df

Old behavior:

In [3]: df.prod()
Out[3]:
Out[3]:
A    24
dtype: int64

Future behavior:

In [4]: df.prod()
...
TypeError: 'DatetimeArray' does not implement reduction 'prod'

In [5]: df[["A"]].prod()
Out[5]:
A    24
dtype: int64

Similarly, when applying a function to :class:`DataFrameGroupBy`, columns on which the function raises TypeError are currently silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

.. ipython:: python

   df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})
   gb = df.groupby([1, 1, 2, 2])

Old behavior:

In [4]: gb.prod(numeric_only=False)
Out[4]:
A
1   2
2  12
In [5]: gb.prod(numeric_only=False)
...
TypeError: datetime64 type does not support prod operations

In [6]: gb[["A"]].prod(numeric_only=False)
Out[6]:
    A
1   2
2  12

Performance improvements

Bug fixes

Categorical

Datetimelike

Timedelta

Timezones

  • Bug in different tzinfo objects representing UTC not being treated as equivalent (:issue:`39216`)
  • Bug in dateutil.tz.gettz("UTC") not being recognized as equivalent to other UTC-representing tzinfos (:issue:`39276`)

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

I/O

Period

Plotting

Groupby/resample/rolling

Reshaping

Sparse

ExtensionArray

Styler

Other

Contributors