Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

Closed
martingoodson opened this issue Jun 12, 2013 · 5 comments · Fixed by #4991
Closed

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

martingoodson opened this issue Jun 12, 2013 · 5 comments · Fixed by #4991
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@martingoodson
Copy link

read_csv gives unexpected behaviour with large files if a column contains both strings and integers. eg

>>> df=DataFrame({'colA':range(500000-1)+['apple', 'pear']+range(500000-1)})
len(set(df.colA))
500001

>>> df.to_csv('testpandas2.txt')
>>> df2=read_csv('testpandas2.txt')
>>> len(set(df2.colA))
762143

 >>> pandas.__version__
'0.11.0'

It seems some of the integers are parsed as integers and others as strings.

>>> list(set(df2.colA))[-10:]
['282248', '282249', '282240', '282241', '282242', '15679', '282244', '282245', '282246', '282247']
>>> list(set(df2.colA))[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
@jreback
Copy link
Contributor

jreback commented Jun 12, 2013

In [1]: df2=read_csv('testpandas2.txt',index_col=0)

In [2]: df2
Out[2]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
colA    1000000  non-null values
dtypes: object(1)

In [3]: from collections import Counter

In [4]: Counter(df2.colA.apply(lambda x: type(x)))
Out[4]: Counter({<type 'int'>: 737856, <type 'str'>: 262144})

So the way parsing works (when you don't specify a specifc dtyp) is that on a particular column you
loop over all dtypes, and try to convert to an actual type; if something breaks you go to the next dtype. The
data is left is modified in-place, so the rows before the strings are converted to integers; when it hits the strings the parsing stops and the column is marked object

so the end result is the correct dtype.

you essentially want downcasting back to strings for object dtype; easy enough, specify object as the dtype for this column.

If you want this automatic I think we'd have to provide an option to do it, because that would be inefficient from a parsing speed as you have to copy the column for every dtype you try

can you explain why this actually matters?

@martingoodson
Copy link
Author

I'm not sure I understand. Why aren't there 500K integers and 500K+2
strings if everything after the string is encountered is parsed as a string?

This matters because if you try and aggregate using the object type column
as a key, the results will be incorrect. You get twice as many keys as you
actually intended. Thus, even trying to find the number of unique keys in a
table, a fairly basic task, will not work as expected.

On Wed, Jun 12, 2013 at 6:14 PM, jreback [email protected] wrote:

In [1]: df2=read_csv('testpandas2.txt',index_col=0)

In [2]: df2
Out[2]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
colA 1000000 non-null values
dtypes: object(1)

In [3]: from collections import Counter

In [4]: Counter(df2.colA.apply(lambda x: type(x)))
Out[4]: Counter({<type 'int'>: 737856, <type 'str'>: 262144})

So the way parsing works (when you don't specify a specifc dtyp) is that
on a particular column you
loop over all dtypes, and try to convert to an actual type; if something
breaks you go to the next dtype. The
data is left is modified in-place, so the rows before the strings are
converted to integers; when it hits the strings the parsing stops and the
column is marked object

so the end result is the correct dtype.

you essentially want downcasting back to strings for object dtype; easy
enough, specify object as the dtype for this column.

If you want this automatic I think we'd have to provide an option to do
it, because that would be inefficient from a parsing speed as you have to
copy the column for every dtype you try

can you explain why this actually matters?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3866#issuecomment-19340528
.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2013

@wesm pls take a look

so the int conversion stops at 262144, which is exactly 2**16 * 4...weird must be something odd going on

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

I can repro, but fix is eluding me :)

@amcpherson
Copy link
Contributor

amcpherson commented Dec 2, 2024

@jreback Did this issue get fixed? This is a very common reason for bugs in code written by my developers. Theres a reason for that... pandas is doing something unexpected. If datatype inference fails, it should not fail silently and produce a mixed datatype column, it should fail with an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants