Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.datetime64 casted to Timestamp when added to DataFrame #17183

Closed
strazdinsg opened this issue Aug 6, 2017 · 7 comments
Closed

numpy.datetime64 casted to Timestamp when added to DataFrame #17183

strazdinsg opened this issue Aug 6, 2017 · 7 comments
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request

Comments

@strazdinsg
Copy link

strazdinsg commented Aug 6, 2017

import numpy
import pandas as pd

# We create a list of strings. 
time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
                '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
# Then we create a time array, rounded to 10ms (actually floored, 
# not rounded), everything seems to be fine here.
rounded_time = numpy.array(time_str_arr, dtype="datetime64[10ms]")
rounded_time 

# Then we create a Pandas DataFrame and assign the time array as a 
# column to it. The datetime64 content is destroyed.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
  'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df = df.assign(wrong_time=rounded_time)
df

# The output I get:
#    one two wrong_time
# a   1.0 1.0 1974-10-01 18:11:07.585
# b   2.0 2.0 1974-10-01 18:11:07.625
# c   3.0 3.0 1974-10-01 18:11:07.645
# d   NaN 4.0 1974-10-01 18:11:07.665

# (the timestamps are wrong)

Problem description

I am having a list of timestamps, with millisecond accuracy, encoded as strings. Then I round them to 10ms resolution, that goes well. The bug comes when I add the rounded timestamps to DataFrame as a new column - the values of datetime64 objects get totally destroyed. My suspicion - the numpy.datetime64 is converted to some other datatype in the DataFrame.assign() method. It should maintain the same type.

Output of pd.show_versions()

INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the Datetime Datetime data dtype label Aug 6, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 6, 2017

@strazdinsg : Thanks for reporting this! For starters, I should point out that datetime64[10ms] is an unusual frequency from our point of view. We largely operate in datetime64[ns]. Thus, I suspect something went funny when we encountered that frequency.

In fact, if you pass in dtype="datetime64[ns]", everything looks OK (though not rounded as you had hoped). What I would suggest (as a workaround) is to recast your timestamps to datetime64[ns] before assigning the column.

Given the unusual nature of the frequency, I'm hesitant to classify as a bug, though we should be able to better handle such an input than silently "break" like this I think.

@jreback
Copy link
Contributor

jreback commented Aug 6, 2017

there is an open issue about this already

@jreback
Copy link
Contributor

jreback commented Aug 6, 2017

#7996

much more idiomatic to simply do:

In [16]: # We create a list of strings. 
    ...: time_str_arr = ['2017-06-30T13:51:15.854', '2017-06-30T13:51:16.250',
    ...:                 '2017-06-30T13:51:16.452', '2017-06-30T13:51:16.659']
    ...: 

In [17]: pd.to_datetime(time_str_arr).floor('10ms')
Out[17]: DatetimeIndex(['2017-06-30 13:51:15.850000', '2017-06-30 13:51:16.250000', '2017-06-30 13:51:16.450000', '2017-06-30 13:51:16.650000'], dtype='datetime64[ns]', freq=None)

numpy in general is not very friendly to any kind of non-standard datetimes. That said pandas should actually convert these non-ns dtypes.

@jreback jreback closed this as completed Aug 6, 2017
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request labels Aug 6, 2017
@jreback jreback added this to the No action milestone Aug 6, 2017
@strazdinsg
Copy link
Author

Thanks for an elegant solution @jreback !

@abhimanyu3-zz
Copy link

abhimanyu3-zz commented Aug 30, 2018

can we convert this to general timestamp... I am not able to do that and when i am applying groupby on this time stamp its not working..I am not even able to sort my columns and i am having millions of observation:- 2018-08-26T14:05:31.000Z
The reason i am trying to sort it because i want to get a time series. I just want my time to be 14:05:31 which is general pd.datetime and on which i can perform groupby or sort whatever i want.

I will really appreciate your help @jreback

@jeybrahms
Copy link

jeybrahms commented Jul 30, 2019

This issue is not limited to exotic types, e.g. it applies to datetime64[D], and it has nothing to do with formatting. The issue appears to be that the resolution changes to ns when a numpy array gets stored in a DataFrame or Series. Here is an example:

make [D] array

In [194]: x = np.array(['2018-01-03'], dtype='datetime64[D]')

put it into pd.Series and extract the values

In [195]: y = pd.Series(x).values

note the dtype has changed to [ns]

In [196]: y.dtype
Out[196]: dtype('<M8[ns]')

turns out x and y arrays are not interchangeable. This creates subtle bugs in cython code.

In [197]: x.view('int') == y.view('int')
Out[197]: array([False])

@TomAugspurger
Copy link
Contributor

@jeybrahms pandas only supports nanosecond-precision datetimes currently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

6 participants