Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

Closed
rahulporuri opened this issue Oct 20, 2016 · 14 comments
Closed
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@rahulporuri
Copy link

rahulporuri commented Oct 20, 2016

A small, complete example of the issue

while trying to open a data file similar to

a
b"
c
d
e"
a,b,c
1,2,3
4,5,6
7,8,9

i expect the following code

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=5, sep=','); print df"

Expected Output

   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Observed Output

Empty DataFrame
Columns: [7,8,9]
Index: []

Further Insight

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=2, sep=','); print df"

surprisingly works. also,

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=5, quotechar="?", sep=','); print df"

works

The behavior changed between pandas 0.18.0 and 0.18.1. we suspect changes made in #12900 to be causing this.

Note that the difference in skiprows values that works (2) and that doesn't (5) is the same as the number of lines in the file between quote chars.

Apologies for the noise if this has already been reported or is being addressed.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@pankajp
Copy link
Contributor

pankajp commented Oct 20, 2016

Some additional notes:

  • The python engine works as expected:
$ python -c "import pandas; df = pandas.read_table('quoted_data.csv', engine='python', skiprows=5, sep=','); print df"
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9
  • The commit 5688d27 is the one which changed this behavior for the c engine, as found out by git bisect

@jorisvandenbossche jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 20, 2016
@jorisvandenbossche
Copy link
Member

The difference in behaviour between python and c engine is not good. But, the question is a bit which of both you want.

cc @gfyoung

I suppose normally newlines in quotes should only be regarded as part of the string if the quotes are 'valid'. I mean, b" is not a valid quoted string, and if that occurs in a normal row this is just interpreted as the string 'b"', not as the start of a multiline string

@gfyoung
Copy link
Member

gfyoung commented Oct 21, 2016

@rahulporuri : Thanks for bringing up this issue! This is not bugged behaviour but rather expected. The reason why you get an empty DataFrame is because that multi-line quote is considered to be a single field value. Thus, the five rows you are skipping are 'a', 'b\nc\nd\ne', a,b,c, 1,2,3, and 4,5,6. Once you are inside the quotation marks, anything goes.

Your "surprising" results behave as expected too. The first two rows are 'a' and 'b\nc\nd\ne', and those are skipped leaving the DataFrame you wanted. The reason why your next example "behaves" is because you specified the quote character to be '?'. Now 'b\nc\nd\ne' is no longer interpreted as a single field but rather as multiple lines because " is no longer a quotation mark character.

The Python behaviour is out of our control because the csv library in Python is the foundation of that parser, and when you read the data you provided, it parses incorrectly IMO by "ignoring" quotation marks (i.e. it treats the \n inside the quotation marks as line terminators). While it is good that you bring up this inconsistency, I wonder in the long-term whether this is worthwhile to "rectify," if there is any good way to do so. I think we should be deprecating the Python parser eventually because I think the C parser is objectively superior both in performance and general correctness.

@jorisvandenbossche : While I don't believe there is a real issue to fix, not entirely sure what would be best to do given my explanation above.

@pankajp
Copy link
Contributor

pankajp commented Oct 25, 2016

I think @jorisvandenbossche 's suggestion is reasonable and expected, quoted field should have quoting in the beginning and end of the field. The example here is artificial but has use in real world data with ' and " being used in quite a few places to indicate feet and inches.
The python engine's behavior seems appropriate.

@jorisvandenbossche
Copy link
Member

@gfyoung to illustrate the difference in how quotes are interpreted in skipped rows vs the data rows:

newline in quoted strings ("a\nb") -> regarded as one row

s1 = """a,b,c
1,"a
b",3
2,"c",4"""

In [2]: pd.read_csv(StringIO(s1))
Out[2]: 
   a     b  c
0  1  a\nb  3
1  2     c  4

if you have a similar case in to skip header rows ("further\nskip"") -> also interpreted as one row -> skiprows=2 instead of 3

s2 = """toskip
"further
skip"
a,b,c
1,"a
b",3
2,"c",4"""

In [4]: pd.read_csv(StringIO(s2), skiprows=2)
Out[4]: 
   a     b  c
0  1  a\nb  3
1  2     c  4

but if the quote is not 'valid' (a"\nb"), it is not interpreted as one row

s3 = """a,b,c
1,a"
b",3
2,"c",4"""

In [7]: pd.read_csv(StringIO(s3))
Out[7]: 
    a   b    c
0   1  a"  NaN
1  b"   3  NaN
2   2   c  4.0

if you have a similar construct in the to skip header rows (further"\nskip"), it is not regarded as separate rows as I still have to use skiprows=2 instead of skiprows=3:

s4 = """toskip
further"
skip"
a,b,c
1,a"
b",3
2,c,4"""

In [9]: pd.read_csv(StringIO(s4), skiprows=2)
Out[9]: 
    a   b    c
0   1  a"  NaN
1  b"   3  NaN
2   2   c  4.0

I am not sure if we have some kind of definition of what a 'valid' quote is, but in any case there is some inconsistency here, and which caused possibly unintended change in the skiprows behaviour.

@gfyoung
Copy link
Member

gfyoung commented Oct 25, 2016

@jorisvandenbossche : Hmmm...so I think that s3 is a bug in your examples. The reason is that we do not go back to parsing as a quoted field once we are inside the field (i.e. your 'a' character means that the quote becomes in-field). It is not a thing of valid and invalid quotation marks. That can be patched.

@rahulporuri : Imagine your field value is a multi-line quote. Would you want Python to butcher it?

@jorisvandenbossche
Copy link
Member

@gfyoung If s3 would be a bug, what would you expect from this:

s = """a,b
1,a"d
2,d"d"d"""

In [14]: pd.read_csv(StringIO(s))
Out[14]: 
   a      b
0  1    a"d
1  2  d"d"d

as this are both example of where quotation marks are not interpreted as starting quotes

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 25, 2016

The reason is that we do not go back to parsing as a quoted field once we are inside the field (i.e. your 'a' character means that the quote becomes in-field).

Yes (that's what I meant with 'invalid' quote, but maybe not a good name), so indeed because the field is already started, the quotation mark is not regarded as the start of a quote. But I don't understand why you say this would be a bug, as you also explain that we deliberately do not go as a quoted field one we are inside the field.

So why not follow the same reasoning for the header lines? If the line does not start with a quotation mark, you already are 'in-field'

@gfyoung
Copy link
Member

gfyoung commented Oct 25, 2016

@jorisvandenbossche : Fair point. Now that I think about it, we could go either way on this:

  1. We consider inline quotation marks as quotation marks to parse.
  2. We consider inline quotation marks as data.

Which one do you think has more use-cases?

@jorisvandenbossche
Copy link
Member

Given that it is the current behaviour of both the python and c engine, I would go with option 2.
And apart from that, option 2 also makes the most sense to me I think.

@jorisvandenbossche
Copy link
Member

But if we think option 2 is the right way, that means that the s3 example from above is not a bug (and that the current behaviour for data rows is the desired one).
But that leaves the question of how to interpret quotes on skipped lines. I would say that we should treat then the same as other rows.

@gfyoung
Copy link
Member

gfyoung commented Oct 25, 2016

@jorisvandenbossche : Okay, but I suspect we're going to take a major performance hit if we have to differentiate between "quoted fields" and "in-field quotes". For example, what happens if your skipped row has multiple quoted fields in a single row?

I tackled this issue before with the C parser when I implemented quotation mark parsing in skipped rows. Right now, whenever we see a quotation mark, we just let anything and everything pass through.

@jorisvandenbossche
Copy link
Member

But didn't we get the performance hit already when we started parsing the skipped rows?
Or is there still a large difference between parsing the skipped rows just for quotes and parsing the skipped rows for both delimiter and quotes ?

@gfyoung
Copy link
Member

gfyoung commented Oct 26, 2016

@jorisvandenbossche : Yes, we did. Now that I think about it, as long as we just check for delimiters (and no other parsing), we could be okay. We might need a couple of other states I think in tokenize_bytes.

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 27, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 27, 2016
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Oct 27, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 27, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 28, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 30, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants