BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

martingoodson · 2013-06-12T16:26:25Z

read_csv gives unexpected behaviour with large files if a column contains both strings and integers. eg

>>> df=DataFrame({'colA':range(500000-1)+['apple', 'pear']+range(500000-1)})
len(set(df.colA))
500001

>>> df.to_csv('testpandas2.txt')
>>> df2=read_csv('testpandas2.txt')
>>> len(set(df2.colA))
762143

 >>> pandas.__version__
'0.11.0'

It seems some of the integers are parsed as integers and others as strings.

>>> list(set(df2.colA))[-10:]
['282248', '282249', '282240', '282241', '282242', '15679', '282244', '282245', '282246', '282247']
>>> list(set(df2.colA))[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

jreback · 2013-06-12T17:14:39Z

In [1]: df2=read_csv('testpandas2.txt',index_col=0)

In [2]: df2
Out[2]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
colA    1000000  non-null values
dtypes: object(1)

In [3]: from collections import Counter

In [4]: Counter(df2.colA.apply(lambda x: type(x)))
Out[4]: Counter({<type 'int'>: 737856, <type 'str'>: 262144})

So the way parsing works (when you don't specify a specifc dtyp) is that on a particular column you
loop over all dtypes, and try to convert to an actual type; if something breaks you go to the next dtype. The
data is left is modified in-place, so the rows before the strings are converted to integers; when it hits the strings the parsing stops and the column is marked object

so the end result is the correct dtype.

you essentially want downcasting back to strings for object dtype; easy enough, specify object as the dtype for this column.

If you want this automatic I think we'd have to provide an option to do it, because that would be inefficient from a parsing speed as you have to copy the column for every dtype you try

can you explain why this actually matters?

martingoodson · 2013-06-12T18:35:02Z

I'm not sure I understand. Why aren't there 500K integers and 500K+2
strings if everything after the string is encountered is parsed as a string?

This matters because if you try and aggregate using the object type column
as a key, the results will be incorrect. You get twice as many keys as you
actually intended. Thus, even trying to find the number of unique keys in a
table, a fairly basic task, will not work as expected.

On Wed, Jun 12, 2013 at 6:14 PM, jreback [email protected] wrote:

In [1]: df2=read_csv('testpandas2.txt',index_col=0)

In [2]: df2
Out[2]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
colA 1000000 non-null values
dtypes: object(1)

In [3]: from collections import Counter

In [4]: Counter(df2.colA.apply(lambda x: type(x)))
Out[4]: Counter({<type 'int'>: 737856, <type 'str'>: 262144})

So the way parsing works (when you don't specify a specifc dtyp) is that
on a particular column you
loop over all dtypes, and try to convert to an actual type; if something
breaks you go to the next dtype. The
data is left is modified in-place, so the rows before the strings are
converted to integers; when it hits the strings the parsing stops and the
column is marked object

so the end result is the correct dtype.

you essentially want downcasting back to strings for object dtype; easy
enough, specify object as the dtype for this column.

If you want this automatic I think we'd have to provide an option to do
it, because that would be inefficient from a parsing speed as you have to
copy the column for every dtype you try

can you explain why this actually matters?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3866#issuecomment-19340528
.

jreback · 2013-06-12T19:33:02Z

@wesm pls take a look

so the int conversion stops at 262144, which is exactly 2**16 * 4...weird must be something odd going on

jreback · 2013-07-10T13:42:45Z

I can repro, but fix is eluding me :)

closes pandas-dev#3866 Silently fix problem rather than warning if we can coerce to numerical type.

amcpherson · 2024-12-02T21:01:02Z

@jreback Did this issue get fixed? This is a very common reason for bugs in code written by my developers. Theres a reason for that... pandas is doing something unexpected. If datatype inference fails, it should not fail silently and produce a mixed datatype column, it should fail with an exception.

floux mentioned this issue Aug 26, 2013

read_csv() result column holds same item with different types from 'c' engine only #4681

Closed

jreback mentioned this issue Aug 27, 2013

BUG: read_csv dtype inferrence is inconsistent for string columns with some integers #4691

Closed

jreback mentioned this issue Sep 24, 2013

BUG: fix skiprows option for python parser in read_csv #4969

Merged

guyrt mentioned this issue Sep 26, 2013

BUG: Conflict between thousands sep and date parser. #4945

Merged

guyrt mentioned this issue Sep 26, 2013

BUG: Warn when dtypes differ in between chunks in csv parser #4991

Merged

guyrt added a commit to guyrt/pandas that referenced this issue Sep 27, 2013

BUG: Warn when dtypes differ in between chunks in csv parser

3104b43

closes pandas-dev#3866 Silently fix problem rather than warning if we can coerce to numerical type.

jreback closed this as completed in c1836fa Sep 29, 2013

jreback mentioned this issue Mar 17, 2015

Type inference problem with read_csv #9669

Closed

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

martingoodson commented Jun 12, 2013

jreback commented Jun 12, 2013

martingoodson commented Jun 12, 2013

jreback commented Jun 12, 2013

jreback commented Jul 10, 2013

amcpherson commented Dec 2, 2024 •

edited

Loading

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

BUG: Unexpected behaviour when reading large text files with mixed datatypes #3866

Comments

martingoodson commented Jun 12, 2013

jreback commented Jun 12, 2013

martingoodson commented Jun 12, 2013

jreback commented Jun 12, 2013

jreback commented Jul 10, 2013

amcpherson commented Dec 2, 2024 • edited Loading

amcpherson commented Dec 2, 2024 •

edited

Loading