Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

rahulporuri · 2016-10-20T12:13:41Z

A small, complete example of the issue

while trying to open a data file similar to

a
b"
c
d
e"
a,b,c
1,2,3
4,5,6
7,8,9

i expect the following code

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=5, sep=','); print df"

Expected Output

Observed Output

Empty DataFrame
Columns: [7,8,9]
Index: []

Further Insight

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=2, sep=','); print df"

surprisingly works. also,

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=5, quotechar="?", sep=','); print df"

works

The behavior changed between pandas 0.18.0 and 0.18.1. we suspect changes made in #12900 to be causing this.

Note that the difference in skiprows values that works (2) and that doesn't (5) is the same as the number of lines in the file between quote chars.

Apologies for the noise if this has already been reported or is being addressed.

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

pankajp · 2016-10-20T12:41:02Z

Some additional notes:

The python engine works as expected:

$ python -c "import pandas; df = pandas.read_table('quoted_data.csv', engine='python', skiprows=5, sep=','); print df"
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

The commit 5688d27 is the one which changed this behavior for the c engine, as found out by git bisect

jorisvandenbossche · 2016-10-20T13:20:30Z

The difference in behaviour between python and c engine is not good. But, the question is a bit which of both you want.

cc @gfyoung

I suppose normally newlines in quotes should only be regarded as part of the string if the quotes are 'valid'. I mean, b" is not a valid quoted string, and if that occurs in a normal row this is just interpreted as the string 'b"', not as the start of a multiline string

gfyoung · 2016-10-21T03:37:43Z

@rahulporuri : Thanks for bringing up this issue! This is not bugged behaviour but rather expected. The reason why you get an empty DataFrame is because that multi-line quote is considered to be a single field value. Thus, the five rows you are skipping are 'a', 'b\nc\nd\ne', a,b,c, 1,2,3, and 4,5,6. Once you are inside the quotation marks, anything goes.

Your "surprising" results behave as expected too. The first two rows are 'a' and 'b\nc\nd\ne', and those are skipped leaving the DataFrame you wanted. The reason why your next example "behaves" is because you specified the quote character to be '?'. Now 'b\nc\nd\ne' is no longer interpreted as a single field but rather as multiple lines because " is no longer a quotation mark character.

The Python behaviour is out of our control because the csv library in Python is the foundation of that parser, and when you read the data you provided, it parses incorrectly IMO by "ignoring" quotation marks (i.e. it treats the \n inside the quotation marks as line terminators). While it is good that you bring up this inconsistency, I wonder in the long-term whether this is worthwhile to "rectify," if there is any good way to do so. I think we should be deprecating the Python parser eventually because I think the C parser is objectively superior both in performance and general correctness.

@jorisvandenbossche : While I don't believe there is a real issue to fix, not entirely sure what would be best to do given my explanation above.

pankajp · 2016-10-25T07:18:25Z

I think @jorisvandenbossche 's suggestion is reasonable and expected, quoted field should have quoting in the beginning and end of the field. The example here is artificial but has use in real world data with ' and " being used in quite a few places to indicate feet and inches.
The python engine's behavior seems appropriate.

jorisvandenbossche · 2016-10-25T08:44:59Z

@gfyoung to illustrate the difference in how quotes are interpreted in skipped rows vs the data rows:

newline in quoted strings ("a\nb") -> regarded as one row

s1 = """a,b,c
1,"a
b",3
2,"c",4"""

In [2]: pd.read_csv(StringIO(s1))
Out[2]: 
   a     b  c
0  1  a\nb  3
1  2     c  4

if you have a similar case in to skip header rows ("further\nskip"") -> also interpreted as one row -> skiprows=2 instead of 3

s2 = """toskip
"further
skip"
a,b,c
1,"a
b",3
2,"c",4"""

In [4]: pd.read_csv(StringIO(s2), skiprows=2)
Out[4]: 
   a     b  c
0  1  a\nb  3
1  2     c  4

but if the quote is not 'valid' (a"\nb"), it is not interpreted as one row

s3 = """a,b,c
1,a"
b",3
2,"c",4"""

In [7]: pd.read_csv(StringIO(s3))
Out[7]: 
    a   b    c
0   1  a"  NaN
1  b"   3  NaN
2   2   c  4.0

if you have a similar construct in the to skip header rows (further"\nskip"), it is not regarded as separate rows as I still have to use skiprows=2 instead of skiprows=3:

s4 = """toskip
further"
skip"
a,b,c
1,a"
b",3
2,c,4"""

In [9]: pd.read_csv(StringIO(s4), skiprows=2)
Out[9]: 
    a   b    c
0   1  a"  NaN
1  b"   3  NaN
2   2   c  4.0

I am not sure if we have some kind of definition of what a 'valid' quote is, but in any case there is some inconsistency here, and which caused possibly unintended change in the skiprows behaviour.

gfyoung · 2016-10-25T12:49:24Z

@jorisvandenbossche : Hmmm...so I think that s3 is a bug in your examples. The reason is that we do not go back to parsing as a quoted field once we are inside the field (i.e. your 'a' character means that the quote becomes in-field). It is not a thing of valid and invalid quotation marks. That can be patched.

@rahulporuri : Imagine your field value is a multi-line quote. Would you want Python to butcher it?

jorisvandenbossche · 2016-10-25T21:27:50Z

@gfyoung If s3 would be a bug, what would you expect from this:

s = """a,b
1,a"d
2,d"d"d"""

In [14]: pd.read_csv(StringIO(s))
Out[14]: 
   a      b
0  1    a"d
1  2  d"d"d

as this are both example of where quotation marks are not interpreted as starting quotes

jorisvandenbossche · 2016-10-25T21:30:07Z

The reason is that we do not go back to parsing as a quoted field once we are inside the field (i.e. your 'a' character means that the quote becomes in-field).

Yes (that's what I meant with 'invalid' quote, but maybe not a good name), so indeed because the field is already started, the quotation mark is not regarded as the start of a quote. But I don't understand why you say this would be a bug, as you also explain that we deliberately do not go as a quoted field one we are inside the field.

So why not follow the same reasoning for the header lines? If the line does not start with a quotation mark, you already are 'in-field'

gfyoung · 2016-10-25T21:33:30Z

@jorisvandenbossche : Fair point. Now that I think about it, we could go either way on this:

We consider inline quotation marks as quotation marks to parse.
We consider inline quotation marks as data.

Which one do you think has more use-cases?

jorisvandenbossche · 2016-10-25T21:47:14Z

Given that it is the current behaviour of both the python and c engine, I would go with option 2.
And apart from that, option 2 also makes the most sense to me I think.

jorisvandenbossche · 2016-10-25T21:56:03Z

But if we think option 2 is the right way, that means that the s3 example from above is not a bug (and that the current behaviour for data rows is the desired one).
But that leaves the question of how to interpret quotes on skipped lines. I would say that we should treat then the same as other rows.

gfyoung · 2016-10-25T22:08:34Z

@jorisvandenbossche : Okay, but I suspect we're going to take a major performance hit if we have to differentiate between "quoted fields" and "in-field quotes". For example, what happens if your skipped row has multiple quoted fields in a single row?

I tackled this issue before with the C parser when I implemented quotation mark parsing in skipped rows. Right now, whenever we see a quotation mark, we just let anything and everything pass through.

jorisvandenbossche · 2016-10-26T13:25:53Z

But didn't we get the performance hit already when we started parsing the skipped rows?
Or is there still a large difference between parsing the skipped rows just for quotes and parsing the skipped rows for both delimiter and quotes ?

gfyoung · 2016-10-26T13:31:46Z

@jorisvandenbossche : Yes, we did. Now that I think about it, as long as we just check for delimiters (and no other parsing), we could be okay. We might need a couple of other states I think in tokenize_bytes.

Closes pandas-devgh-14459.

Closes gh-14459.

… lines (pandas-dev#14514) Closes pandas-devgh-14459. (cherry picked from commit b088112)

jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 20, 2016

jorisvandenbossche added the Bug label Oct 20, 2016

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 27, 2016

BUG: Don't parse inline quotes in skipped lines

5a0556e

Closes pandas-devgh-14459.

gfyoung mentioned this issue Oct 27, 2016

BUG: Don't parse inline quotes in skipped lines #14514

Merged

jorisvandenbossche added this to the 0.19.1 milestone Oct 27, 2016

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Oct 27, 2016

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 27, 2016

BUG: Don't parse inline quotes in skipped lines

4fb55f6

Closes pandas-devgh-14459.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 28, 2016

BUG: Don't parse inline quotes in skipped lines

1662b8f

Closes pandas-devgh-14459.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 30, 2016

BUG: Don't parse inline quotes in skipped lines

2e41dab

Closes pandas-devgh-14459.

jorisvandenbossche closed this as completed in #14514 Oct 31, 2016

jorisvandenbossche pushed a commit that referenced this issue Oct 31, 2016

BUG: Don't parse inline quotes in skipped lines (#14514)

b088112

Closes gh-14459.

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

[Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped…

e8a71a3

… lines (pandas-dev#14514) Closes pandas-devgh-14459. (cherry picked from commit b088112)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

rahulporuri commented Oct 20, 2016 •

edited

Loading

pankajp commented Oct 20, 2016 •

edited

Loading

jorisvandenbossche commented Oct 20, 2016

gfyoung commented Oct 21, 2016 •

edited

Loading

pankajp commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016

gfyoung commented Oct 25, 2016 •

edited

Loading

jorisvandenbossche commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016 •

edited

Loading

gfyoung commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016

gfyoung commented Oct 25, 2016 •

edited

Loading

jorisvandenbossche commented Oct 26, 2016

gfyoung commented Oct 26, 2016

Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

Comments

rahulporuri commented Oct 20, 2016 • edited Loading

A small, complete example of the issue

Expected Output

Observed Output

Further Insight

Output of pd.show_versions()

pankajp commented Oct 20, 2016 • edited Loading

jorisvandenbossche commented Oct 20, 2016

gfyoung commented Oct 21, 2016 • edited Loading

pankajp commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016

gfyoung commented Oct 25, 2016 • edited Loading

jorisvandenbossche commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016 • edited Loading

gfyoung commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016

jorisvandenbossche commented Oct 25, 2016

gfyoung commented Oct 25, 2016 • edited Loading

jorisvandenbossche commented Oct 26, 2016

gfyoung commented Oct 26, 2016

rahulporuri commented Oct 20, 2016 •

edited

Loading

Output of `pd.show_versions()`

pankajp commented Oct 20, 2016 •

edited

Loading

gfyoung commented Oct 21, 2016 •

edited

Loading

gfyoung commented Oct 25, 2016 •

edited

Loading

jorisvandenbossche commented Oct 25, 2016 •

edited

Loading

gfyoung commented Oct 25, 2016 •

edited

Loading