Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import dataframe using set_dataframe, TypeError when column dtype: 'Int64' (with <NA> and integer) #405

Closed
keyapi opened this issue Feb 20, 2020 · 8 comments

Comments

@keyapi
Copy link

keyapi commented Feb 20, 2020

In pandas dataframe, dtype of a column with NaN+integer would be changed into 'float', in cases of must-keep-integer, it's useful to replace dtype:'float' with dtype:'Int64', see: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

I use wks.set_dataframe(df, (1,1), nan="") to transfer a dataframe to a gsheet worksheet, but the pandas column with dtype:'float' (mixture of NaN and integer) could only be transfered as 'float' in a gsheet, but what i wanted for this column is only ""+integer, not float.

I also tried changing the dtype of the column first into 'Int64', e.g. df.column = df.column.astype('Int64') and then set_dataframe(), but got a "TypeError: <U1 cannot be converted to an IntegerDtype"

solution 1:
new parameter in set_dataframe() to import float as int

solution 2:
new parameter in set_dataframe() to deal with dytpe: "Int64", take < > like a "NaN"

@keyapi keyapi changed the title import dataframe using set_dataframe, TypeError when column dtype: 'Int64' (with NaN and integer) import dataframe using set_dataframe, TypeError when column dtype: 'Int64' (with <NA> and integer) Feb 20, 2020
@nithinmurali
Copy link
Owner

Can you please provide a small example to reproduce this?

@keyapi
Copy link
Author

keyapi commented Feb 23, 2020

Can you please provide a small example to reproduce this?

example snippet, set_dataframe() cloumn with 2 dtype: float / Int64, nullable integer

https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

import numpy as np
import pandas as pd
import pygsheets

pygsheets authorize and worksheet

gc = pygsheets.authorize()
wks = gc.open_by_key().sheet1

try 1: a dataframe with 2 columns, 'col2' with a NaN

d = {'col1': [1, 2], 'col2': [3, np.nan]}
df = pd.DataFrame(d)
print(df.dtypes)
col1 int64
col2 float64

print(df)

col1 col2
0 1 3.0
1 2 NaN

wks.set_dataframe(df, (1,1), nan="")

worksheet, 'col2' will keep float and "", "" is good, but 3.0 is not what i want

col1 col2
0 1 3.0
1 2

try 2: make a copy and change dtype of 'col2' to 'Int64' (not int64)

df_Int64 = df.copy()
df_Int64['col2'] = df_Int64['col2'].astype('Int64')
print(df_Int64.dtypes)
col1 int64
col2 Int64

print(df_Int64) # 'col2' will keep integer 3 and NaN, both are what i want

col1 col2
0 1 3
1 2 NaN

wks.set_dataframe(df_Int64, (1,1), nan="")

TypeError: <U1 cannot be converted to an IntegerDtype

@gosuto-inzasheru
Copy link
Contributor

gosuto-inzasheru commented Feb 26, 2020

Workaround for this is to call set_dataframe() with nan=pd.NA. It will prevent to attempt to cast the pd.NA values to string.

@gosuto-inzasheru
Copy link
Contributor

Problem is that pandas v1.0.0 introduced a new dtype; Int64 (opposed to int64). This dtype allows for <NA> in an integer column whereas previously this was only possible with dtype object.

Calling .fillna() on this new dtype is what is causing said error; an Int64 column cannot contain the NaN string that is provided by the kwargs by default.

df = df.fillna(nan)

My suggestion would be to drop the nan= feature completely and let .astype('unicode') handle the casting.

If you want I can pick up this issue and submit a PR.

@keyapi
Copy link
Author

keyapi commented Feb 27, 2020

Workaround for this is to call set_dataframe() with nan=pd.NA. It will prevent to attempt to cast the pd.NA values to string.

do you mean wks.set_dataframe(df_Int64, (1,1), nan=np.nan) ?
i tried, got no Error, gesheet shows

col1 col2
1 3
2 nan

but nan is not what i expected for this column on gsheet, better kept as empty like below.

col1 col2
1 3
2

Problem is that pandas v1.0.0 introduced a new dtype; Int64

That dtype Int64 should be since pandas 0.24.0 (Jan 2019), i tried it last year

pygsheets/pygsheets/worksheet.py

Line 1303 in 8a74911

df = df.fillna(nan)

i think the gerneral problem: .fillna() + Int64 lies in pandas.

My suggestion would be to drop the nan= feature completely and let .astype('unicode') handle the casting.

i think the option nan= for .set_dataframe() is still convenient for simple user like me, who just transfers DataFrame to gsheets.

your sugesstion .astype('unicode) is a good option, would be better just to change the column with dtype:'Int64', but i tried pandas 0.25 and 1.0.1, both can still not distinguish dtypes int64 and 'Int64'
print(df_Int64.dtypes) will give 'col1' as int64 and col2 as Int64(not int64)
but print([col for col in list(df_Int64.columns) if df_Int64[col].dtypes=='Int64']) will give a list ['col1', 'col2'], wtf... still pandas problem

@ldsalomone
Copy link

ldsalomone commented Mar 6, 2020

This isn't a problem within pygsheets, but instead a problem with pandas: pandas-dev/pandas#25288

A fix can be done by calling astype('object') before .fillna().

d = {'col1': [1, 2], 'col2': [3, np.nan]}
df = pd.DataFrame(d)

df.col2 = df.col2.astype('Int64')
print(df)
print(df.col2.dtype)
print(type(df.col2[0]))
print(type(df.col2[1]))

df = df.astype('object').fillna('')
print(df)
print(df.col2.dtype)
print(type(df.col2[0]))
print(type(df.col2[1]))

This is also a non-issue because the first version of the function worked (without the Int62 dtype) and once uploaded to google sheets, just tell google sheets to only display 0 decimal point for your whole table.

@gosuto-inzasheru
Copy link
Contributor

@keyapi are you still experiencing problems? Latest version should have it fixed.

@nithinmurali
Copy link
Owner

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants