numpy.datetime64 casted to Timestamp when added to DataFrame #17183

strazdinsg · 2017-08-06T19:16:29Z

import numpy
import pandas as pd

# We create a list of strings. 
time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
                '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
# Then we create a time array, rounded to 10ms (actually floored, 
# not rounded), everything seems to be fine here.
rounded_time = numpy.array(time_str_arr, dtype="datetime64[10ms]")
rounded_time 

# Then we create a Pandas DataFrame and assign the time array as a 
# column to it. The datetime64 content is destroyed.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df = df.assign(wrong_time=rounded_time)
df

# The output I get:
#    one two wrong_time
# a   1.0 1.0 1974-10-01 18:11:07.585
# b   2.0 2.0 1974-10-01 18:11:07.625
# c   3.0 3.0 1974-10-01 18:11:07.645
# d   NaN 4.0 1974-10-01 18:11:07.665

# (the timestamps are wrong)

Problem description

I am having a list of timestamps, with millisecond accuracy, encoded as strings. Then I round them to 10ms resolution, that goes well. The bug comes when I add the rounded timestamps to DataFrame as a new column - the values of datetime64 objects get totally destroyed. My suspicion - the numpy.datetime64 is converted to some other datatype in the DataFrame.assign() method. It should maintain the same type.

Output of `pd.show_versions()`

INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-08-06T19:23:00Z

@strazdinsg : Thanks for reporting this! For starters, I should point out that datetime64[10ms] is an unusual frequency from our point of view. We largely operate in datetime64[ns]. Thus, I suspect something went funny when we encountered that frequency.

In fact, if you pass in dtype="datetime64[ns]", everything looks OK (though not rounded as you had hoped). What I would suggest (as a workaround) is to recast your timestamps to datetime64[ns] before assigning the column.

Given the unusual nature of the frequency, I'm hesitant to classify as a bug, though we should be able to better handle such an input than silently "break" like this I think.

jreback · 2017-08-06T19:49:10Z

there is an open issue about this already

jreback · 2017-08-06T21:24:17Z

#7996

much more idiomatic to simply do:

In [16]: # We create a list of strings. 
    ...: time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
    ...:                 '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
    ...: 

In [17]: pd.to_datetime(time_str_arr).floor('10ms')
Out[17]: DatetimeIndex(['2017-06-30 13:51:15.850000', '2017-06-30 13:51:16.250000', '2017-06-30 13:51:16.450000', '2017-06-30 13:51:16.650000'], dtype='datetime64[ns]', freq=None)

numpy in general is not very friendly to any kind of non-standard datetimes. That said pandas should actually convert these non-ns dtypes.

strazdinsg · 2017-08-07T06:52:59Z

Thanks for an elegant solution @jreback !

abhimanyu3-zz · 2018-08-30T20:40:54Z

can we convert this to general timestamp... I am not able to do that and when i am applying groupby on this time stamp its not working..I am not even able to sort my columns and i am having millions of observation:- 2018-08-26T14:05:31.000Z
The reason i am trying to sort it because i want to get a time series. I just want my time to be 14:05:31 which is general pd.datetime and on which i can perform groupby or sort whatever i want.

I will really appreciate your help @jreback

jeybrahms · 2019-07-30T01:42:24Z

This issue is not limited to exotic types, e.g. it applies to datetime64[D], and it has nothing to do with formatting. The issue appears to be that the resolution changes to ns when a numpy array gets stored in a DataFrame or Series. Here is an example:

make [D] array

In [194]: x = np.array(['2018-01-03'], dtype='datetime64[D]')

put it into pd.Series and extract the values

In [195]: y = pd.Series(x).values

note the dtype has changed to [ns]

In [196]: y.dtype
Out[196]: dtype('<M8[ns]')

turns out x and y arrays are not interchangeable. This creates subtle bugs in cython code.

In [197]: x.view('int') == y.view('int')
Out[197]: array([False])

TomAugspurger · 2019-07-30T02:14:21Z

@jeybrahms pandas only supports nanosecond-precision datetimes currently.

gfyoung added the Datetime Datetime data dtype label Aug 6, 2017

jreback closed this as completed Aug 6, 2017

jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Aug 6, 2017

jreback added this to the No action milestone Aug 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy.datetime64 casted to Timestamp when added to DataFrame #17183

numpy.datetime64 casted to Timestamp when added to DataFrame #17183

strazdinsg commented Aug 6, 2017 •

edited

Loading

gfyoung commented Aug 6, 2017 •

edited

Loading

jreback commented Aug 6, 2017

jreback commented Aug 6, 2017

strazdinsg commented Aug 7, 2017

abhimanyu3-zz commented Aug 30, 2018 •

edited

Loading

jeybrahms commented Jul 30, 2019 •

edited

Loading

TomAugspurger commented Jul 30, 2019

numpy.datetime64 casted to Timestamp when added to DataFrame #17183

numpy.datetime64 casted to Timestamp when added to DataFrame #17183

Comments

strazdinsg commented Aug 6, 2017 • edited Loading

Problem description

Output of pd.show_versions()

gfyoung commented Aug 6, 2017 • edited Loading

jreback commented Aug 6, 2017

jreback commented Aug 6, 2017

strazdinsg commented Aug 7, 2017

abhimanyu3-zz commented Aug 30, 2018 • edited Loading

jeybrahms commented Jul 30, 2019 • edited Loading

make [D] array

put it into pd.Series and extract the values

note the dtype has changed to [ns]

turns out x and y arrays are not interchangeable. This creates subtle bugs in cython code.

TomAugspurger commented Jul 30, 2019

strazdinsg commented Aug 6, 2017 •

edited

Loading

Output of `pd.show_versions()`

gfyoung commented Aug 6, 2017 •

edited

Loading

abhimanyu3-zz commented Aug 30, 2018 •

edited

Loading

jeybrahms commented Jul 30, 2019 •

edited

Loading