Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: The XLS_SIGNATURE is too restrictive #41225

Closed
geoffrey-eisenbarth opened this issue Apr 29, 2021 · 4 comments · Fixed by #41321
Closed

ENH: The XLS_SIGNATURE is too restrictive #41225

geoffrey-eisenbarth opened this issue Apr 29, 2021 · 4 comments · Fixed by #41321
Labels
Enhancement IO Excel read_excel, to_excel
Milestone

Comments

@geoffrey-eisenbarth
Copy link
Contributor

geoffrey-eisenbarth commented Apr 29, 2021

Pandas 1.2.4 fails to open XLS files generated by Lotus 1-2-3 because they have a different header than the one expected in XLS_SIGNATURE (io.excel._base line 1001 and then 1050).

I know I'm likely to be the only person in the world with this issue, but my boss still uses Lotus 1-2-3 🙄

I'm willing to submit a pull request with my fix if people are okay with it. The fix basically involves changing XLS_SIGNATURE to a list:

XLS_SIGNATURES = [
  b"\xD0\xCF\x11\xE0\xA1\xB1\xA1\xE1",
  b"\t\x04\x06",
]
ZIP_SIGNATURE = b"PK\x03\x04"
PEEK_SIZE = max(map(len, XLS_SIGNATURES + [ZIP_SIGNATURE]))

and then later:

if any(peek.startswith(signature) for signature in XLS_SIGNATURES):
  return "xls"

I don't know if the second XLS_SIGNATURES value is "good," but it works. I can attach an XLS that was exported by 1-2-3 if it would be beneficial to others, or if they'd be able to help me find a better second value for XLS_SIGNATURES. I tried looking through the byte-code for anything similar to the "DOCFILE" XLS signature, but didn't see anything.

@rhshadrach
Copy link
Member

Thanks for the report! An example sheet would be helpful to further diagnose.

@rhshadrach rhshadrach added IO Excel read_excel, to_excel Enhancement labels Apr 30, 2021
@geoffrey-eisenbarth
Copy link
Contributor Author

geoffrey-eisenbarth commented May 3, 2021

@rhshadrach See attached. If you have any suggestions I'd be happy to put together a PR.

PANDAS.XLS

For future reference: OpenOffice has made available the various BIFF files for testing.

@rhshadrach
Copy link
Member

Thanks @geoffrey-eisenbarth - documentation is hard to find, but I think this is a BIFF4 format:

http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=681&strPageToDisplay=signatures

This agrees with your code: b"\x09\x04\x06\x00\x00\x00" (note I'm replacing your usage of \t). What you suggested would be a welcome PR.

However, it seems likely to me there are other file signatures that will still go unrecognized. One of the shortfalls of the current pandas implementation is that if the user specifies engine but we can't verify the file is correct, we will raise without just trying the engine.

Instead, I think inspect_excel_format should return None if a format cannot be determined instead of raising. Then, we should raise if engine is None with a message to the effect "Excel file format cannot be determined, you must specify an engine manually", and otherwise attempt to use the engine the user specified.

Would you be interested in putting up a PR for this @geoffrey-eisenbarth?

@geoffrey-eisenbarth
Copy link
Contributor Author

@rhshadrach I agree with your suggested route, and I'd love to submit the PR. Hoping to have time tomorrow to dig more into the code and reference your suggestion with inspect_excel_format. Should be a fast PR.

Thanks for your help!

@rhshadrach rhshadrach added this to the 1.3 milestone May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants