-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: loosen XLS signature #41321
ENH: loosen XLS signature #41321
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests
Also have to add whatsnew, but could start with tests
@phofl Thanks for the heads up. I didn't set up the pandas dev environment properly so I'm going to try to take care of that today and address the failing tests. |
@phofl @rhshadrach I think I've got it mostly squared away, however the Azure pipeline is throwing an error because it can't import |
xlrd is not installed in every ci, you should probably import this in your test runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some cosmetic comments
doc/source/whatsnew/v1.2.5.rst
Outdated
@@ -34,8 +34,7 @@ Bug fixes | |||
|
|||
Other | |||
~~~~~ | |||
|
|||
- | |||
- Loosen XLS signatures used in :func:`read_excel` to determine if the `xlrd` engine should be used (:issue:`41225`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move to 1.3, I think we have a section for IO excel or something like that
pandas/io/excel/_base.py
Outdated
|
||
|
||
@doc(storage_options=_shared_docs["storage_options"]) | ||
def inspect_excel_format( | ||
content_or_path: FilePathOrBuffer, | ||
storage_options: StorageOptions = None, | ||
) -> str: | ||
) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str | None instead of optional
Just to clarify, do you mean the import should be moved to inside the testing class method? Will the test be bypassed if xlrd is not installed in the ci? |
Yes into the if engine == xlrd |
pandas/io/excel/_base.py
Outdated
|
||
|
||
@doc(storage_options=_shared_docs["storage_options"]) | ||
def inspect_excel_format( | ||
content_or_path: FilePathOrBuffer, | ||
storage_options: StorageOptions = None, | ||
) -> str: | ||
) -> Union[str, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please str | None, this is the new mechanism to type this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for the heads up: I looked around in the typing documentation and didn't see this feature. Will adjust now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see https://docs.python.org/3.10/whatsnew/3.10.html#pep-604-new-type-union-operator if you are interested
@phofl I have updated the existing related tests, should I also write new ones that verify that Thanks for your help and patience! |
Please add new tests, if the old tests don't cover everything. Since you've added a bunch of new signatures we need test for every one of them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks really good, some minor requests below. Agree with @phofl on the need for tests.
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -831,6 +831,7 @@ I/O | |||
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`) | |||
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`) | |||
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`) | |||
- Loosen XLS signatures used in :func:`read_excel` to determine if the `xlrd` engine should be used (:issue:`41225`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of signatures is internal to the implementation; can you change to something like "Bug in read_excel not detecting older xls formats". If you'd like to call this an enhancement (I don't have an opinion either way), then this line should be moved to the enhancement section above.
Also, I think a separate line should be added addressing the bugfix here that we would raise when engine is specified but the excel format could not be determined.
pandas/io/excel/_base.py
Outdated
str | ||
Format of file. | ||
str or None | ||
Format of file (if it can be determined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove parentheses - the phrase is essential to the correctness of the statement and not supplemental. Also, period at the end.
Thanks for the feedback! Hoping to add the tests today. I imagine it's considered bad form to write test that rely on downloading outside resources (specifically the various Excel files hosted at https://www.openoffice.org/sc/testdocs/). If that's the case, should the tests just use |
Yes, creating the tests without external files would be perfect |
I added a test, but I'm afraid it's probably not the best. I haven't been able to find blank BIFF2, BIFF3, BIFF4, BIFF5 files in order to observe their header and body bytes to make more accurate If Any advice or suggestions would be greatly appreciated. |
pandas/io/excel/_base.py
Outdated
"biff5": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1", | ||
"biff8": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the biff version matter? I think no - in which case can you make this a tuple with comments, e.g.
BIFF_SIGNATURES = (
...
b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1", # biff5 and biff8
)
ZIP_SIGNATURE = ...
pandas/tests/io/excel/test_xlrd.py
Outdated
import io | ||
import struct | ||
|
||
headers = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use pytest.mark.parametrize instead
pandas/tests/io/excel/test_xlrd.py
Outdated
def test_read_old_xls_files(): | ||
# GH 41226 | ||
import io | ||
import struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move imports to top of file
pandas/tests/io/excel/test_xlrd.py
Outdated
for file_format, header in headers.items(): | ||
f = io.BytesIO(header + body) | ||
with pytest.raises(struct.error, match="unpack requires a buffer "): | ||
# If struct.error is raised, file has passed xlrd's filetype checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just test inspect_excel_format is giving us xls for these
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -831,6 +830,7 @@ I/O | |||
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`) | |||
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`) | |||
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`) | |||
- :func:`read_excel` now raises the specified engine's exception for incorrect file types if the excel format cannot be determined by pandas (:issue:`41225`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more significant (and great!) change here is that we are trying the engine even when we can't detect the file type when the user specifies it. Something like
Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the
engine
argument
pandas/tests/io/excel/test_xlrd.py
Outdated
"biff4": b"\x09\x04\x06\x00\x00\x00\x10\x00", | ||
"biff5": b"\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1", | ||
} | ||
body = b"\x00" * 16 + b"\x3e\x00\x03\x00\xfe\xff" # Required for biff5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is required for biff5, then do we want to add that to the signatures in _base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon more research, the BIFF5 (and ongoing) filetypes are actually Compound Binary Files (CFB, which have the "DOCFILE" hex header). This US Library of Congress page shows that the CFB header requires extra bits to eventually specify the endianness of the file, which is the error xlrd
was giving me if I didn't include the extra body
variable referenced above. However, this is no longer necessary since we're testing whether inspect_excel_format
returns xls
or not.
raise ValueError( | ||
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, " | ||
f"only the xls format is supported. Install openpyxl instead." | ||
) | ||
elif ext != "xls": | ||
elif ext and ext != "xls": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if engine="xlrd"
and ext=None
and is not updated via inspect_excel_format
due to a bad format, what happens here? It seems the error is pushed down to self._engines[engine](self._io, storage_options)
if this is genuine oversight then a test might be worth it for this case, and an errormessage, otherwise can ignore..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@attack68 thanks for looking over this! My understanding of the desired behavior was to go ahead and try reading it with the xlrd
engine even if pandas can't determine the filetype (i.e., ext=None
). I suppose I could add an else
case to the if statement you're referencing that produces a warning?
Any input is appreciated, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe.. not sure what is the best solution really. just wanted to make sure the case was considered... someone else might have a good idea or just leave it with that which you stated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user is specifying the engine explicitly, I do not think we should raise an error/warning. They have already instructed what they want to do.
@phofl @rhshadrach I believe I have addressed all the concerns (please let me know if you're still expecting anything). Should I click the "re-request review" buttons for your two requests? Thanks! |
Could you merge master? I'll leave the final go to @rhshadrach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, minor requests on the whatsnew.
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -843,6 +843,8 @@ I/O | |||
- Bug in :func:`read_csv` and :func:`read_excel` not respecting dtype for duplicated column name when ``mangle_dupe_cols`` is set to ``True`` (:issue:`35211`) | |||
- Bug in :func:`read_csv` and :func:`read_table` misinterpreting arguments when ``sys.setprofile`` had been previously called (:issue:`41069`) | |||
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`) | |||
- Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the ``engine`` argument (:issue:`41225`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:func:`read_excel`
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -197,7 +197,7 @@ Other enhancements | |||
- Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`) | |||
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`) | |||
- Add support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`) | |||
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) | |||
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) and older .xls files (:issue:`41225`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(:issue:35416
, :issue:41225
) at the end of the sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @geoffrey-eisenbarth! |
@rhshadrach thanks for your guidance! |
Added ability to check for multiple XLS signatures according to testing files available at Spreadsheet Project. Also defer raising an error when the engine is specified but the file signature does not match one of the values in
SIGNATURES
: this allows a user to attempt to specify and use an engine even if the passed file doesn't match one of the provided values inSIGNATURES
.This is my first PR for this project, so please let me know if more is expected (writing tests, writing a whatsnew entry, etc). I have run Flake8 on the code and successfully opened BIFF2 through BIFF8 files with this method. Thanks!