REGR: Row series are broken after applying to_datetime() #36785

krassowski · 2020-10-01T20:56:11Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

from io import StringIO
from pandas import read_csv, to_datetime, options

df = read_csv(StringIO("""\
,A,B,C,D,E,F
P0,,2020-10-01 08:00:00+00:00,,,,2020-10-16 00:01:00+00:00
"""), index_col=0)

# works
options.display.max_rows = 6
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

# raises
options.display.max_rows = 5
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

# raises
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: x.dropna().apply(lambda y: getattr(y, 'year')), axis=1)
# in v 1.0.4 would return:
#        B     F
# P0  2020  2020

Problem description

Assigning the output of pd.to_datetime to a column of a dataframe, although not demonstrated in documentation is a popular use of this helper function. In previous versions of pandas (1.0.x) it was possible to convert multiple columns using to_datetime in combination with DataFrame.apply. It is still possible in 1.1.2 and master:

>>>df.apply(lambda d: to_datetime(d, utc=True), axis=0).dtypes
A    datetime64[ns, UTC]
B    datetime64[ns, UTC]
            ...         
E    datetime64[ns, UTC]
F    datetime64[ns, UTC]
Length: 6, dtype: object

However, the rows in the subsequent apply operations are broken for certain operations. For example, trying to print them out raises: TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid. This was not the case in pandas 1.0.4.

X in Y
     10 
     11 # raises
---> 12 df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7545             kwds=kwds,
   7546         )
-> 7547         return op.get_result()
   7548 
   7549     def applymap(self, func) -> "DataFrame":

/pandas/core/apply.py in get_result(self)
    178             return self.apply_raw()
    179 
--> 180         return self.apply_standard()
    181 
    182     def apply_empty_result(self):

/pandas/core/apply.py in apply_standard(self)
    253 
    254     def apply_standard(self):
--> 255         results, res_index = self.apply_series_generator()
    256 
    257         # wrap results

/pandas/core/apply.py in apply_series_generator(self)
    282                 for i, v in enumerate(series_gen):
    283                     # ignore SettingWithCopy here in case the user mutates
--> 284                     results[i] = self.f(v)
    285                     if isinstance(results[i], ABCSeries):
    286                         # If we have a view on v, we need to make a copy because

X in <lambda>(x)
     10 df.apply(lambda x: str(x), axis=1)
     11 # raises
---> 12 df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

/pandas/core/series.py in __repr__(self)
   1313         show_dimensions = get_option("display.show_dimensions")
   1314 
-> 1315         self.to_string(
   1316             buf=buf,
   1317             name=self.name,

/pandas/core/series.py in to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
   1372             String representation of Series if ``buf=None``, otherwise None.
   1373         """
-> 1374         formatter = fmt.SeriesFormatter(
   1375             self,
   1376             name=name,

/pandas/io/formats/format.py in __init__(self, series, buf, length, header, index, na_rep, name, float_format, dtype, max_rows, min_rows)
    259         self.adj = _get_adjustment()
    260 
--> 261         self._chk_truncate()
    262 
    263     def _chk_truncate(self) -> None:

/pandas/io/formats/format.py in _chk_truncate(self)
    283             else:
    284                 row_num = max_rows // 2
--> 285                 series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
    286             self.tr_row_num = row_num
    287         else:

/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    272     ValueError: Indexes have overlapping values: ['a']
    273     """
--> 274     op = _Concatenator(
    275         objs,
    276         axis=axis,

/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    357                     "only Series and DataFrame objs are valid"
    358                 )
--> 359                 raise TypeError(msg)
    360 
    361             # consolidate

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

Expected Output

Should not raise.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-48-generic
Version : #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.2
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.3.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.49.0

The text was updated successfully, but these errors were encountered:

krassowski · 2020-10-01T21:12:41Z

I added another manifestation of the underlying issue:

df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: x.dropna().apply(lambda y: getattr(y, 'year')), axis=1)

does not work any longer, even though it used to work in 1.0.4 where it was producing:

       B     F
P0  2020  2020

while on 1.1.3/master it raises: AttributeError: 'numpy.ndarray' object has no attribute 'apply'. At the heart of the problem it seems that the data is converted to numpy arrays in a way which is not handled well by pandas.

krassowski · 2020-10-01T21:32:09Z

A workaround:

def workaround(dates):
    return pd.Series(list(dates), index=dates.index)   # workaround

df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: workaround(x).dropna().apply(lambda y: getattr(y, 'year')), axis=1)

Interestingly:

df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: print(type(x)), axis=1)
<class 'pandas.core.series.Series'>
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: print(type(x.dropna())), axis=1)
<class 'numpy.ndarray'>

This a pure manifestation of the underlying issue and at odds with the documentation which says that Series.dropna: will Return a new Series with missing values removed (but we get numpy.ndarray instead!).

phofl · 2020-10-04T22:03:24Z

Hi, thanks for your report.

This was introduced by 91802a9
cc @jbrockmendel

The bug occurs in

pandas/pandas/core/apply.py

Line 411 in b76031d

blk.values = arr

The array has dtype object and changes the dtype of the BlockManager.
Would using sanitize_array be a good solution here? This was used before to convert the array to the correct dtype. Tests seem to pass with this.

jreback · 2020-10-04T22:10:39Z

this is extremely strange to do and unless you can demonstrate that this is a real case -1
on any change

why are you not just directly using to_datetime? what is the point of apply here at all?

phofl · 2020-10-04T22:32:24Z

@jreback

The first apply is not the problem. The following

df = pd.DataFrame(
    {
        "A": [pd.to_datetime("20130101", utc=True)]
    }
)
result = df.apply(lambda x: x, axis=1)

raises

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.2/scratches/scratch_5.py", line 274, in <module>
    result = df.apply(lambda x: x, axis=1)
  File "/home/developer/PycharmProjects/pandas/pandas/core/frame.py", line 7569, in apply
    return op.get_result()
  File "/home/developer/PycharmProjects/pandas/pandas/core/apply.py", line 180, in get_result
    return self.apply_standard()
  File "/home/developer/PycharmProjects/pandas/pandas/core/apply.py", line 271, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/developer/PycharmProjects/pandas/pandas/core/apply.py", line 304, in apply_series_generator
    results[i] = results[i].copy(deep=False)
  File "/home/developer/PycharmProjects/pandas/pandas/core/generic.py", line 5897, in copy
    data = self._mgr.copy(deep=deep)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 768, in copy
    res = self.apply("copy", deep=deep)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 397, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 696, in copy
    return self.make_block_same_class(values, ndim=self.ndim)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 263, in make_block_same_class
    return type(self)(values, placement=placement, ndim=ndim)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 1612, in __init__
    values = self._maybe_coerce_values(values)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 2241, in _maybe_coerce_values
    values = self._holder(values)
  File "/home/developer/PycharmProjects/pandas/pandas/core/arrays/datetimes.py", line 254, in __init__
    raise ValueError(
ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead.

Process finished with exit code 1

simonjayhawkins · 2020-12-03T16:26:53Z

on 1.0.5

>>> pd.__version__
'1.0.5'
>>> df = pd.DataFrame({"A": [pd.to_datetime("20130101", utc=True)]})
>>> df.loc[0]
A   2013-01-01 00:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
>>>
>>> def func(x):
...     print(x)
...     return x
...
>>> res = df.apply(func, axis=1)
A   2013-01-01 00:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
>>> res
                          A
0 2013-01-01 00:00:00+00:00
>>> res.A
0   2013-01-01 00:00:00+00:00
Name: A, dtype: datetime64[ns, UTC]

on 1.1.4

>>> pd.__version__
'1.1.4'
>>> df = pd.DataFrame({"A": [pd.to_datetime("20130101", utc=True)]})
>>> df.loc[0]
A   2013-01-01 00:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
>>>
>>> def func(x):
...     print(x)
...     return x
...
>>> res = df.apply(func, axis=1)
A    2013-01-01 00:00:00+00:00
Name: 0, dtype: object
Traceback (most recent call last):
...
ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead.

so it appears that the dtype of the Series passed to the user supplied function as changed.

simonjayhawkins · 2020-12-03T16:30:24Z

The array has dtype object and changes the dtype of the BlockManager.

I think the fix should ensure that what is passed to the user function is unchanged from 1.0.5 to avoid other regressions being reported.

jbrockmendel · 2020-12-03T18:35:37Z

Looks like the problem is inside FrameColumnApply.series_generator, need to check dtype match when setting blk.values

krassowski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2020

krassowski changed the title ~~BUG:~~ BUG: Series representation does not work after to_datetime() Oct 1, 2020

krassowski changed the title ~~BUG: Series representation does not work after to_datetime()~~ BUG: Row series are broken after applying to_datetime() Oct 1, 2020

phofl added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2020

phofl added this to the 1.1.4 milestone Oct 4, 2020

jreback removed this from the 1.1.4 milestone Oct 4, 2020

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 3, 2020

add code sample for pandas-dev#36785

500d8ed

simonjayhawkins changed the title ~~BUG: Row series are broken after applying to_datetime()~~ REGR: Row series are broken after applying to_datetime() Dec 3, 2020

jbrockmendel mentioned this issue Dec 3, 2020

BUG: DataFrame.apply with axis=1 and EA dtype #38272

Merged

5 tasks

simonjayhawkins added this to the 1.1.5 milestone Dec 3, 2020

jreback closed this as completed in #38272 Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: Row series are broken after applying to_datetime() #36785

REGR: Row series are broken after applying to_datetime() #36785

krassowski commented Oct 1, 2020 •

edited

Loading

INSTALLED VERSIONS

krassowski commented Oct 1, 2020

krassowski commented Oct 1, 2020 •

edited

Loading

phofl commented Oct 4, 2020

jreback commented Oct 4, 2020

phofl commented Oct 4, 2020

simonjayhawkins commented Dec 3, 2020

simonjayhawkins commented Dec 3, 2020

jbrockmendel commented Dec 3, 2020

REGR: Row series are broken after applying to_datetime() #36785

REGR: Row series are broken after applying to_datetime() #36785

Comments

krassowski commented Oct 1, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

krassowski commented Oct 1, 2020

krassowski commented Oct 1, 2020 • edited Loading

phofl commented Oct 4, 2020

jreback commented Oct 4, 2020

phofl commented Oct 4, 2020

simonjayhawkins commented Dec 3, 2020

simonjayhawkins commented Dec 3, 2020

jbrockmendel commented Dec 3, 2020

krassowski commented Oct 1, 2020 •

edited

Loading

Output of `pd.show_versions()`

krassowski commented Oct 1, 2020 •

edited

Loading