Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: reset_index of level on a MultiIndex with NaT converts to np.nan #11479

Closed
emsems opened this issue Oct 30, 2015 · 5 comments · Fixed by #41389
Closed

BUG: reset_index of level on a MultiIndex with NaT converts to np.nan #11479

emsems opened this issue Oct 30, 2015 · 5 comments · Fixed by #41389
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@emsems
Copy link

emsems commented Oct 30, 2015

Not sure if it's know already, but couldn't find any open issues.
It's similar to this closed one:
#10388

using pandas 0.17 and numpy 1.10.1

code to reproduce the issues:

import pandas as pd
from pandas import DataFrame
import numpy as np

idx = np.arange(0, 10)  # could have an NaN?
tstamp = pd.date_range('201507010000', freq='h', periods=10).values
df = DataFrame({'id': idx, 'tstamp': tstamp, 'a': list('abcdefghij')})
df.loc[3, 'tstamp'] = pd.NaT

print 'without timezone:'
try:
    a = df.set_index(['id', 'tstamp']).reset_index('tstamp')
    print 'a works'
except Exception as e:
    print 'a fails: %s: %s' %(e.__class__.__name__, e)
try:
    b = df.set_index(['id', 'tstamp']).reset_index('tstamp').reset_index('id')
    print 'b works'
except Exception as e:
    print 'b fails: %s: %s' %(e.__class__.__name__, e)
try:
    c =  df.set_index(['id', 'tstamp']).reset_index()
    print 'c works'
except Exception as e:
    print 'c fails: %s: %s' %(e.__class__.__name__, e)
try:
    d =  df.set_index(['id', 'tstamp']).reset_index('id')
    print 'd works'
except Exception as e:
    print 'd fails: %s: %s' %(e.__class__.__name__, e)
print 'with timezone:'
df['tstamp'] = pd.DatetimeIndex(df['tstamp']).tz_localize('Europe/Berlin')
try:
    a = df.set_index(['id', 'tstamp']).reset_index('tstamp')
    print 'a works'
except Exception as e:
    print 'a fails: %s: %s' %(e.__class__.__name__, e)
try:
    b = df.set_index(['id', 'tstamp']).reset_index('tstamp').reset_index('id')
    print 'b works'
except Exception as e:
    print 'b fails: %s: %s' %(e.__class__.__name__, e)
try:
    c =  df.set_index(['id', 'tstamp']).reset_index()
    print 'c works'
except Exception as e:
    print 'c fails: %s: %s' %(e.__class__.__name__, e)
try:
    d =  df.set_index(['id', 'tstamp']).reset_index('id')
    print 'd works'
except Exception as e:
    print 'd fails: %s: %s' %(e.__class__.__name__, e)

Output:
without timezone:
a works
b works
c works
d fails: ValueError: Could not convert object to NumPy datetime
with timezone:
a fails: TypeError: data type not understood
b fails: TypeError: data type not understood
c fails: TypeError: data type not understood
d fails: ValueError: Could not convert object to NumPy datetime

Can you suggest any workaround?

@jreback
Copy link
Contributor

jreback commented Oct 30, 2015

a lot of these are actually fixed by #11343 as we will be catching these errors in putmask.
cc @sinhrks

all of that said, you generallly do not want NaN's of any kind in the multiindex it makes your index non-performant and simply hard to work with w.r.t. indexing.

what are you actually trying to do?

@jreback jreback added Bug Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex Difficulty Intermediate labels Oct 30, 2015
@jreback jreback added this to the Next Major Release milestone Oct 30, 2015
@emsems
Copy link
Author

emsems commented Nov 2, 2015

Hi, thanks for picking this up!

Yeah, I understand it’s not ideal to have NaN in the index. However I wrote a class (actually a number of classes) to do my data evaluation, which need a DataFrame of a defined structure in it’s .data property. Datapoints always have an index and mostly a timestamp as well. The index is unique, the timestamp not always, but mostly. Therefore, I figured it would be generally good to have them both in a MultiIndex (because if there is timestamp information, I usually use this to select data).
In some evaluations I prefer for simplicity to have only one oft he two in the index, therefore the drop_index, which is used internally in a method of one of my classes.
I hope this makes some sense without all the context...

@mroeschke
Copy link
Member

These example all work on master. Could use tests:

In [189]: import pandas as pd
     ...: from pandas import DataFrame
     ...: import numpy as np
     ...:
     ...: idx = np.arange(0, 10)  # could have an NaN?
     ...: tstamp = pd.date_range('201507010000', freq='h', periods=10).values
     ...: df = DataFrame({'id': idx, 'tstamp': tstamp, 'a': list('abcdefghij')})
     ...: df.loc[3, 'tstamp'] = pd.NaT

In [190]: a = df.set_index(['id', 'tstamp']).reset_index('tstamp')

In [191]: b = df.set_index(['id', 'tstamp']).reset_index('tstamp').reset_index('id')

In [192]: c =  df.set_index(['id', 'tstamp']).reset_index()

In [194]: d =  df.set_index(['id', 'tstamp']).reset_index('id')

In [195]: df['tstamp'] = pd.DatetimeIndex(df['tstamp']).tz_localize('Europe/Berlin')

In [196]: a = df.set_index(['id', 'tstamp']).reset_index('tstamp')

In [197]: b = df.set_index(['id', 'tstamp']).reset_index('tstamp').reset_index('id')

In [198]: c =  df.set_index(['id', 'tstamp']).reset_index()

In [199]: d =  df.set_index(['id', 'tstamp']).reset_index('id')

In [200]: pd.__version__
Out[200]: '0.26.0.dev0+593.g9d45934af'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Oct 21, 2019
@mroeschke
Copy link
Member

Actually the last cases are not fixed yet as the NaT gets transformed to nan

In [24]: d
Out[24]:
                           id  a
tstamp
2015-07-01 00:00:00+02:00   0  a
2015-07-01 01:00:00+02:00   1  b
2015-07-01 02:00:00+02:00   2  c
NaN                         3  d
2015-07-01 04:00:00+02:00   4  e
2015-07-01 05:00:00+02:00   5  f
2015-07-01 06:00:00+02:00   6  g
2015-07-01 07:00:00+02:00   7  h
2015-07-01 08:00:00+02:00   8  i
2015-07-01 09:00:00+02:00   9  j

@mroeschke mroeschke added MultiIndex Datetime Datetime data dtype and removed Needs Tests Unit test(s) needed to prevent regressions good first issue labels Jan 4, 2020
@mroeschke mroeschke changed the title Various Exceptions with NaT in MultiIndex and reset_index() BUG: reset_index of level on a MultiIndex with NaT converts to np.nan Mar 31, 2020
@mroeschke mroeschke added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 31, 2020
@mroeschke
Copy link
Member

The last case appears to work now

In [8]: In [189]: import pandas as pd
   ...:      ...: from pandas import DataFrame
   ...:      ...: import numpy as np
   ...:      ...:
   ...:      ...: idx = np.arange(0, 10)  # could have an NaN?
   ...:      ...: tstamp = pd.date_range('201507010000', freq='h', periods=10).values
   ...:      ...: df = DataFrame({'id': idx, 'tstamp': tstamp, 'a': list('abcdefghij')})
   ...:      ...: df.loc[3, 'tstamp'] = pd.NaT

In [9]: df.set_index(['id', 'tstamp']).reset_index('id')
Out[9]:
                     id  a
tstamp
2015-07-01 00:00:00   0  a
2015-07-01 01:00:00   1  b
2015-07-01 02:00:00   2  c
NaT                   3  d
2015-07-01 04:00:00   4  e
2015-07-01 05:00:00   5  f
2015-07-01 06:00:00   6  g
2015-07-01 07:00:00   7  h
2015-07-01 08:00:00   8  i
2015-07-01 09:00:00   9  j

In [10]: pd.__version__
Out[10]: '1.3.0.dev0+1368.gccb90d60c6'

@mroeschke mroeschke removed Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 20, 2021
@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed MultiIndex Datetime Datetime data dtype labels Apr 20, 2021
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.3 May 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants