Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xlrd: in2csv fails to load codepage 21010 xls files #859

Closed
acook opened this issue Jul 14, 2017 · 8 comments · Fixed by #861
Closed

xlrd: in2csv fails to load codepage 21010 xls files #859

acook opened this issue Jul 14, 2017 · 8 comments · Fixed by #861

Comments

@acook
Copy link

acook commented Jul 14, 2017

This is a common issue in CSV parsers/Excel exporters apparently:

It seems to be generated from MacOS versions of Excel.

Basically, when encountering codepage 21010 it should interpret it as codepage 1200 (AKA UTF-16le).

Ideally this would be handled programmatically. However, even passing in --encoding utf-16le (or other variations) seem to have no effect on in2csv, so it might be ignoring the encoding argument?

@jpmckinney
Copy link
Member

Sample file from linked issues: https://www.dropbox.com/s/fubuqla710n64iz/passbookInq.xls

@jpmckinney
Copy link
Member

jpmckinney commented Jul 14, 2017

(That file 404s.) @acook Can you provide a file that produces the error? Otherwise impossible to test. My macOS version of Excel (v15.36) doesn't create files like this.

@acook
Copy link
Author

acook commented Jul 14, 2017

Ah I didn't realize the other file was missing, my mistake. Hmm, looks like GitHub refuses to allow me to upload an XLS, even though they allow XLSX, so I had to wrap it in a zip.

Further research on my end indicates the codepage 21010 XLS files may be generated by some other tool, not Mac Excel as I first assumed (since the test file I received came from a Mac user and I've had other similar issues with Mac Excel recently).

(incidentally, the info in the spreadsheet is all fake generated data)

@jpmckinney
Copy link
Member

ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010
Traceback (most recent call last):
  File "/Users/james/.pyenv/versions/csvkit2/bin/in2csv", line 11, in <module>
    load_entry_point('csvkit==1.0.3', 'console_scripts', 'in2csv')()
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/csvkit/utilities/in2csv.py", line 183, in launch_new_instance
    utility.run()
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/csvkit/cli.py", line 114, in run
    self.main()
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/csvkit/utilities/in2csv.py", line 145, in main
    table = agate.Table.from_xls(self.input_file, sheet=self.args.sheet, **kwargs)
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/src/agateexcel/agateexcel/table_xls.py", line 30, in from_xls
    book = xlrd.open_workbook(file_contents=path.read())
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/xlrd/__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/xlrd/book.py", line 1170, in parse_globals
    self.handle_codepage(data)
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/xlrd/book.py", line 794, in handle_codepage
    self.derive_encoding()
  File "/Users/james/.pyenv/versions/2.7.13/envs/csvkit2/lib/python2.7/site-packages/xlrd/book.py", line 775, in derive_encoding
    _unused = unicode(b'trial', self.encoding)
LookupError: unknown encoding: unknown_codepage_21010

So it's an upstream bug in xlrd (which we use to read XLS files), and there are existing issues:

https://github.com/python-excel/xlrd/issues/218
https://github.com/python-excel/xlrd/issues/111

@jpmckinney jpmckinney changed the title in2csv fails to load codepage 21010 xls files xlrd: in2csv fails to load codepage 21010 xls files Jul 14, 2017
@acook
Copy link
Author

acook commented Jul 14, 2017

It did seem to originate in xlrd but I thought there might be a fix at a higher level. Especially given that the xlrd maintainer doesn't seem interested in fixing it.

Why does in2csv ignore the --encoding param? Is that also due to xlrd?

@jpmckinney
Copy link
Member

The encoding param isn't relevant to Excel-to-CSV conversion - only CSV-to-CSV conversion.

@acook
Copy link
Author

acook commented Jul 14, 2017

Ah bummer. I think XLRD does allow the encoding to be specified, which may mitigate this issue. If the XLRD maintainers get back to me, I can see what there is to be done to fix the problem at that level. Otherwise, looking into a different backend for in2csv might be a possibility.

@jpmckinney
Copy link
Member

I added an --encoding-xls option that passes a value to xlrd's encoding_override. If you run in2csv --encoding-xls utf-8 test.xls it should work (but you need to use csvkit HEAD and then install agate-excel HEAD on top of it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants