-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459
Comments
Some additional notes:
|
The difference in behaviour between python and c engine is not good. But, the question is a bit which of both you want. cc @gfyoung I suppose normally newlines in quotes should only be regarded as part of the string if the quotes are 'valid'. I mean, |
@rahulporuri : Thanks for bringing up this issue! This is not bugged behaviour but rather expected. The reason why you get an empty DataFrame is because that multi-line quote is considered to be a single field value. Thus, the five rows you are skipping are Your "surprising" results behave as expected too. The first two rows are The Python behaviour is out of our control because the @jorisvandenbossche : While I don't believe there is a real issue to fix, not entirely sure what would be best to do given my explanation above. |
I think @jorisvandenbossche 's suggestion is reasonable and expected, quoted field should have quoting in the beginning and end of the field. The example here is artificial but has use in real world data with |
@gfyoung to illustrate the difference in how quotes are interpreted in skipped rows vs the data rows: newline in quoted strings (
if you have a similar case in to skip header rows (
but if the quote is not 'valid' (
if you have a similar construct in the to skip header rows (
I am not sure if we have some kind of definition of what a 'valid' quote is, but in any case there is some inconsistency here, and which caused possibly unintended change in the |
@jorisvandenbossche : Hmmm...so I think that @rahulporuri : Imagine your field value is a multi-line quote. Would you want Python to butcher it? |
@gfyoung If
as this are both example of where quotation marks are not interpreted as starting quotes |
Yes (that's what I meant with 'invalid' quote, but maybe not a good name), so indeed because the field is already started, the quotation mark is not regarded as the start of a quote. But I don't understand why you say this would be a bug, as you also explain that we deliberately do not go as a quoted field one we are inside the field. So why not follow the same reasoning for the header lines? If the line does not start with a quotation mark, you already are 'in-field' |
@jorisvandenbossche : Fair point. Now that I think about it, we could go either way on this:
Which one do you think has more use-cases? |
Given that it is the current behaviour of both the python and c engine, I would go with option 2. |
But if we think option 2 is the right way, that means that the |
@jorisvandenbossche : Okay, but I suspect we're going to take a major performance hit if we have to differentiate between "quoted fields" and "in-field quotes". For example, what happens if your skipped row has multiple quoted fields in a single row? I tackled this issue before with the C parser when I implemented quotation mark parsing in skipped rows. Right now, whenever we see a quotation mark, we just let anything and everything pass through. |
But didn't we get the performance hit already when we started parsing the skipped rows? |
@jorisvandenbossche : Yes, we did. Now that I think about it, as long as we just check for delimiters (and no other parsing), we could be okay. We might need a couple of other states I think in |
… lines (pandas-dev#14514) Closes pandas-devgh-14459. (cherry picked from commit b088112)
A small, complete example of the issue
while trying to open a data file similar to
i expect the following code
Expected Output
Observed Output
Further Insight
surprisingly works. also,
works
The behavior changed between
pandas
0.18.0
and0.18.1
. we suspect changes made in #12900 to be causing this.Note that the difference in
skiprows
values that works (2) and that doesn't (5) is the same as the number of lines in the file between quote chars.Apologies for the noise if this has already been reported or is being addressed.
Output of
pd.show_versions()
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: