Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Row series are broken after applying to_datetime() #36785

Closed
3 tasks done
krassowski opened this issue Oct 1, 2020 · 8 comments · Fixed by #38272
Closed
3 tasks done

REGR: Row series are broken after applying to_datetime() #36785

krassowski opened this issue Oct 1, 2020 · 8 comments · Fixed by #38272
Labels
Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@krassowski
Copy link

krassowski commented Oct 1, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

from io import StringIO
from pandas import read_csv, to_datetime, options

df = read_csv(StringIO("""\
,A,B,C,D,E,F
P0,,2020-10-01 08:00:00+00:00,,,,2020-10-16 00:01:00+00:00
"""), index_col=0)

# works
options.display.max_rows = 6
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

# raises
options.display.max_rows = 5
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

# raises
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: x.dropna().apply(lambda y: getattr(y, 'year')), axis=1)
# in v 1.0.4 would return:
#        B     F
# P0  2020  2020

Problem description

Assigning the output of pd.to_datetime to a column of a dataframe, although not demonstrated in documentation is a popular use of this helper function. In previous versions of pandas (1.0.x) it was possible to convert multiple columns using to_datetime in combination with DataFrame.apply. It is still possible in 1.1.2 and master:

>>>df.apply(lambda d: to_datetime(d, utc=True), axis=0).dtypes
A    datetime64[ns, UTC]
B    datetime64[ns, UTC]
            ...         
E    datetime64[ns, UTC]
F    datetime64[ns, UTC]
Length: 6, dtype: object

However, the rows in the subsequent apply operations are broken for certain operations. For example, trying to print them out raises: TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid. This was not the case in pandas 1.0.4.

X in Y
     10 
     11 # raises
---> 12 df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7545             kwds=kwds,
   7546         )
-> 7547         return op.get_result()
   7548 
   7549     def applymap(self, func) -> "DataFrame":

/pandas/core/apply.py in get_result(self)
    178             return self.apply_raw()
    179 
--> 180         return self.apply_standard()
    181 
    182     def apply_empty_result(self):

/pandas/core/apply.py in apply_standard(self)
    253 
    254     def apply_standard(self):
--> 255         results, res_index = self.apply_series_generator()
    256 
    257         # wrap results

/pandas/core/apply.py in apply_series_generator(self)
    282                 for i, v in enumerate(series_gen):
    283                     # ignore SettingWithCopy here in case the user mutates
--> 284                     results[i] = self.f(v)
    285                     if isinstance(results[i], ABCSeries):
    286                         # If we have a view on v, we need to make a copy because

X in <lambda>(x)
     10 df.apply(lambda x: str(x), axis=1)
     11 # raises
---> 12 df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: str(x), axis=1)

/pandas/core/series.py in __repr__(self)
   1313         show_dimensions = get_option("display.show_dimensions")
   1314 
-> 1315         self.to_string(
   1316             buf=buf,
   1317             name=self.name,

/pandas/core/series.py in to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
   1372             String representation of Series if ``buf=None``, otherwise None.
   1373         """
-> 1374         formatter = fmt.SeriesFormatter(
   1375             self,
   1376             name=name,

/pandas/io/formats/format.py in __init__(self, series, buf, length, header, index, na_rep, name, float_format, dtype, max_rows, min_rows)
    259         self.adj = _get_adjustment()
    260 
--> 261         self._chk_truncate()
    262 
    263     def _chk_truncate(self) -> None:

/pandas/io/formats/format.py in _chk_truncate(self)
    283             else:
    284                 row_num = max_rows // 2
--> 285                 series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
    286             self.tr_row_num = row_num
    287         else:

/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    272     ValueError: Indexes have overlapping values: ['a']
    273     """
--> 274     op = _Concatenator(
    275         objs,
    276         axis=axis,

/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    357                     "only Series and DataFrame objs are valid"
    358                 )
--> 359                 raise TypeError(msg)
    360 
    361             # consolidate

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid

Expected Output

Should not raise.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-48-generic
Version : #52-Ubuntu SMP Thu Sep 10 10:58:49 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.1.2
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.3.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.49.0

@krassowski krassowski added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2020
@krassowski krassowski changed the title BUG: BUG: Series representation does not work after to_datetime() Oct 1, 2020
@krassowski krassowski changed the title BUG: Series representation does not work after to_datetime() BUG: Row series are broken after applying to_datetime() Oct 1, 2020
@krassowski
Copy link
Author

I added another manifestation of the underlying issue:

df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: x.dropna().apply(lambda y: getattr(y, 'year')), axis=1)

does not work any longer, even though it used to work in 1.0.4 where it was producing:

       B     F
P0  2020  2020

while on 1.1.3/master it raises: AttributeError: 'numpy.ndarray' object has no attribute 'apply'. At the heart of the problem it seems that the data is converted to numpy arrays in a way which is not handled well by pandas.

@krassowski
Copy link
Author

krassowski commented Oct 1, 2020

A workaround:

def workaround(dates):
    return pd.Series(list(dates), index=dates.index)   # workaround

df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: workaround(x).dropna().apply(lambda y: getattr(y, 'year')), axis=1)

Interestingly:

df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: print(type(x)), axis=1)
<class 'pandas.core.series.Series'>
df.apply(lambda d: to_datetime(d, utc=True), axis=0).apply(lambda x: print(type(x.dropna())), axis=1)
<class 'numpy.ndarray'>

This a pure manifestation of the underlying issue and at odds with the documentation which says that Series.dropna: will Return a new Series with missing values removed (but we get numpy.ndarray instead!).

@phofl
Copy link
Member

phofl commented Oct 4, 2020

Hi, thanks for your report.

This was introduced by 91802a9
cc @jbrockmendel

The bug occurs in

blk.values = arr

The array has dtype object and changes the dtype of the BlockManager.
Would using sanitize_array be a good solution here? This was used before to convert the array to the correct dtype. Tests seem to pass with this.

@phofl phofl added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 4, 2020
@phofl phofl added this to the 1.1.4 milestone Oct 4, 2020
@jreback
Copy link
Contributor

jreback commented Oct 4, 2020

this is extremely strange to do and unless you can demonstrate that this is a real case -1
on any change

why are you not just directly using to_datetime? what is the point of apply here at all?

@jreback jreback removed this from the 1.1.4 milestone Oct 4, 2020
@phofl
Copy link
Member

phofl commented Oct 4, 2020

@jreback

The first apply is not the problem. The following

df = pd.DataFrame(
    {
        "A": [pd.to_datetime("20130101", utc=True)]
    }
)
result = df.apply(lambda x: x, axis=1)

raises

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.2/scratches/scratch_5.py", line 274, in <module>
    result = df.apply(lambda x: x, axis=1)
  File "/home/developer/PycharmProjects/pandas/pandas/core/frame.py", line 7569, in apply
    return op.get_result()
  File "/home/developer/PycharmProjects/pandas/pandas/core/apply.py", line 180, in get_result
    return self.apply_standard()
  File "/home/developer/PycharmProjects/pandas/pandas/core/apply.py", line 271, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/developer/PycharmProjects/pandas/pandas/core/apply.py", line 304, in apply_series_generator
    results[i] = results[i].copy(deep=False)
  File "/home/developer/PycharmProjects/pandas/pandas/core/generic.py", line 5897, in copy
    data = self._mgr.copy(deep=deep)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 768, in copy
    res = self.apply("copy", deep=deep)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/managers.py", line 397, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 696, in copy
    return self.make_block_same_class(values, ndim=self.ndim)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 263, in make_block_same_class
    return type(self)(values, placement=placement, ndim=ndim)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 1612, in __init__
    values = self._maybe_coerce_values(values)
  File "/home/developer/PycharmProjects/pandas/pandas/core/internals/blocks.py", line 2241, in _maybe_coerce_values
    values = self._holder(values)
  File "/home/developer/PycharmProjects/pandas/pandas/core/arrays/datetimes.py", line 254, in __init__
    raise ValueError(
ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead.

Process finished with exit code 1

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 3, 2020
@simonjayhawkins
Copy link
Member

on 1.0.5

>>> pd.__version__
'1.0.5'
>>> df = pd.DataFrame({"A": [pd.to_datetime("20130101", utc=True)]})
>>> df.loc[0]
A   2013-01-01 00:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
>>>
>>> def func(x):
...     print(x)
...     return x
...
>>> res = df.apply(func, axis=1)
A   2013-01-01 00:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
>>> res
                          A
0 2013-01-01 00:00:00+00:00
>>> res.A
0   2013-01-01 00:00:00+00:00
Name: A, dtype: datetime64[ns, UTC]

on 1.1.4

>>> pd.__version__
'1.1.4'
>>> df = pd.DataFrame({"A": [pd.to_datetime("20130101", utc=True)]})
>>> df.loc[0]
A   2013-01-01 00:00:00+00:00
Name: 0, dtype: datetime64[ns, UTC]
>>>
>>> def func(x):
...     print(x)
...     return x
...
>>> res = df.apply(func, axis=1)
A    2013-01-01 00:00:00+00:00
Name: 0, dtype: object
Traceback (most recent call last):
...
ValueError: The dtype of 'values' is incorrect. Must be 'datetime64[ns]'. Got object instead.

so it appears that the dtype of the Series passed to the user supplied function as changed.

@simonjayhawkins
Copy link
Member

The array has dtype object and changes the dtype of the BlockManager.

I think the fix should ensure that what is passed to the user function is unchanged from 1.0.5 to avoid other regressions being reported.

@simonjayhawkins simonjayhawkins changed the title BUG: Row series are broken after applying to_datetime() REGR: Row series are broken after applying to_datetime() Dec 3, 2020
@jbrockmendel
Copy link
Member

Looks like the problem is inside FrameColumnApply.series_generator, need to check dtype match when setting blk.values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants