Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: loosen XLS signature #41321

Merged
merged 19 commits into from
May 21, 2021
Merged

ENH: loosen XLS signature #41321

merged 19 commits into from
May 21, 2021

Conversation

geoffrey-eisenbarth
Copy link
Contributor

@geoffrey-eisenbarth geoffrey-eisenbarth commented May 4, 2021

Added ability to check for multiple XLS signatures according to testing files available at Spreadsheet Project. Also defer raising an error when the engine is specified but the file signature does not match one of the values in SIGNATURES: this allows a user to attempt to specify and use an engine even if the passed file doesn't match one of the provided values in SIGNATURES.

This is my first PR for this project, so please let me know if more is expected (writing tests, writing a whatsnew entry, etc). I have run Flake8 on the code and successfully opened BIFF2 through BIFF8 files with this method. Thanks!

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests

Also have to add whatsnew, but could start with tests

@jreback jreback changed the title Fix #41225. ENH: loosen XLS signature May 5, 2021
@jreback jreback added Enhancement IO Excel read_excel, to_excel labels May 5, 2021
@geoffrey-eisenbarth
Copy link
Contributor Author

@phofl Thanks for the heads up. I didn't set up the pandas dev environment properly so I'm going to try to take care of that today and address the failing tests.

@geoffrey-eisenbarth
Copy link
Contributor Author

@phofl @rhshadrach I think I've got it mostly squared away, however the Azure pipeline is throwing an error because it can't import xlrd (in order to catch XLRDError). Do you have a suggestion for how to resolve this? Thanks!

@phofl
Copy link
Member

phofl commented May 5, 2021

xlrd is not installed in every ci, you should probably import this in your test runtime

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cosmetic comments

@@ -34,8 +34,7 @@ Bug fixes

Other
~~~~~

-
- Loosen XLS signatures used in :func:`read_excel` to determine if the `xlrd` engine should be used (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to 1.3, I think we have a section for IO excel or something like that



@doc(storage_options=_shared_docs["storage_options"])
def inspect_excel_format(
content_or_path: FilePathOrBuffer,
storage_options: StorageOptions = None,
) -> str:
) -> Optional[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str | None instead of optional

@geoffrey-eisenbarth
Copy link
Contributor Author

xlrd is not installed in every ci, you should probably import this in your test runtime

Just to clarify, do you mean the import should be moved to inside the testing class method? Will the test be bypassed if xlrd is not installed in the ci?

@phofl
Copy link
Member

phofl commented May 5, 2021

Yes into the if engine == xlrd
If you look at the beginning of the file you will See that the engines are constructed based on the installed packages



@doc(storage_options=_shared_docs["storage_options"])
def inspect_excel_format(
content_or_path: FilePathOrBuffer,
storage_options: StorageOptions = None,
) -> str:
) -> Union[str, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please str | None, this is the new mechanism to type this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for the heads up: I looked around in the typing documentation and didn't see this feature. Will adjust now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffrey-eisenbarth
Copy link
Contributor Author

@phofl I have updated the existing related tests, should I also write new ones that verify that read_excel will read older XLS files, or are we good to go?

Thanks for your help and patience!

@phofl
Copy link
Member

phofl commented May 5, 2021

Please add new tests, if the old tests don't cover everything. Since you've added a bunch of new signatures we need test for every one of them

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks really good, some minor requests below. Agree with @phofl on the need for tests.

@@ -831,6 +831,7 @@ I/O
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
- Loosen XLS signatures used in :func:`read_excel` to determine if the `xlrd` engine should be used (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of signatures is internal to the implementation; can you change to something like "Bug in read_excel not detecting older xls formats". If you'd like to call this an enhancement (I don't have an opinion either way), then this line should be moved to the enhancement section above.

Also, I think a separate line should be added addressing the bugfix here that we would raise when engine is specified but the excel format could not be determined.

str
Format of file.
str or None
Format of file (if it can be determined)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove parentheses - the phrase is essential to the correctness of the statement and not supplemental. Also, period at the end.

@geoffrey-eisenbarth
Copy link
Contributor Author

Thanks for the PR! Looks really good, some minor requests below. Agree with @phofl on the need for tests.

Thanks for the feedback! Hoping to add the tests today. I imagine it's considered bad form to write test that rely on downloading outside resources (specifically the various Excel files hosted at https://www.openoffice.org/sc/testdocs/). If that's the case, should the tests just use io.BytesIO objects that are created using the values in the SIGNATURES dictionary?

@phofl
Copy link
Member

phofl commented May 6, 2021

Yes, creating the tests without external files would be perfect

@geoffrey-eisenbarth
Copy link
Contributor Author

geoffrey-eisenbarth commented May 6, 2021

I added a test, but I'm afraid it's probably not the best. I haven't been able to find blank BIFF2, BIFF3, BIFF4, BIFF5 files in order to observe their header and body bytes to make more accurate io.BytesIO streams to test against. I have examined files from the Spreadsheet Project's Test Documents, but since those files are not empty (and I imagine some of the first few bytes describe how long the document is, etc) I'm not sure it's the best solution.

If xlrd throws the struct.error exception, then the file has passed xlrd's type inspection and has been determined to be a valid BIFF file (otherwise it would throw a custom BIFF exception). I imagine the reason the struct.error gets thrown is because the length of the BytesIO does not match the length specified in the header/body bytes (although this is just speculation).

Any advice or suggestions would be greatly appreciated.

Comment on lines 1021 to 1022
"biff5": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",
"biff8": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the biff version matter? I think no - in which case can you make this a tuple with comments, e.g.

BIFF_SIGNATURES = (
    ...
    b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",  # biff5 and biff8
)
ZIP_SIGNATURE = ...

import io
import struct

headers = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use pytest.mark.parametrize instead

def test_read_old_xls_files():
# GH 41226
import io
import struct
Copy link
Member

@rhshadrach rhshadrach May 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move imports to top of file

for file_format, header in headers.items():
f = io.BytesIO(header + body)
with pytest.raises(struct.error, match="unpack requires a buffer "):
# If struct.error is raised, file has passed xlrd's filetype checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just test inspect_excel_format is giving us xls for these

@@ -831,6 +830,7 @@ I/O
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
- :func:`read_excel` now raises the specified engine's exception for incorrect file types if the excel format cannot be determined by pandas (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more significant (and great!) change here is that we are trying the engine even when we can't detect the file type when the user specifies it. Something like

Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the engine argument

"biff4": b"\x09\x04\x06\x00\x00\x00\x10\x00",
"biff5": b"\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1",
}
body = b"\x00" * 16 + b"\x3e\x00\x03\x00\xfe\xff" # Required for biff5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is required for biff5, then do we want to add that to the signatures in _base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon more research, the BIFF5 (and ongoing) filetypes are actually Compound Binary Files (CFB, which have the "DOCFILE" hex header). This US Library of Congress page shows that the CFB header requires extra bits to eventually specify the endianness of the file, which is the error xlrd was giving me if I didn't include the extra body variable referenced above. However, this is no longer necessary since we're testing whether inspect_excel_format returns xls or not.

raise ValueError(
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
f"only the xls format is supported. Install openpyxl instead."
)
elif ext != "xls":
elif ext and ext != "xls":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if engine="xlrd" and ext=None and is not updated via inspect_excel_format due to a bad format, what happens here? It seems the error is pushed down to self._engines[engine](self._io, storage_options)

if this is genuine oversight then a test might be worth it for this case, and an errormessage, otherwise can ignore..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attack68 thanks for looking over this! My understanding of the desired behavior was to go ahead and try reading it with the xlrd engine even if pandas can't determine the filetype (i.e., ext=None). I suppose I could add an else case to the if statement you're referencing that produces a warning?

Any input is appreciated, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe.. not sure what is the best solution really. just wanted to make sure the case was considered... someone else might have a good idea or just leave it with that which you stated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is specifying the engine explicitly, I do not think we should raise an error/warning. They have already instructed what they want to do.

@geoffrey-eisenbarth
Copy link
Contributor Author

@phofl @rhshadrach I believe I have addressed all the concerns (please let me know if you're still expecting anything). Should I click the "re-request review" buttons for your two requests? Thanks!

@phofl
Copy link
Member

phofl commented May 13, 2021

Could you merge master? I'll leave the final go to @rhshadrach

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, minor requests on the whatsnew.

@@ -843,6 +843,8 @@ I/O
- Bug in :func:`read_csv` and :func:`read_excel` not respecting dtype for duplicated column name when ``mangle_dupe_cols`` is set to ``True`` (:issue:`35211`)
- Bug in :func:`read_csv` and :func:`read_table` misinterpreting arguments when ``sys.setprofile`` had been previously called (:issue:`41069`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
- Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the ``engine`` argument (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:func:`read_excel`

@@ -197,7 +197,7 @@ Other enhancements
- Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`)
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
- Add support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) and older .xls files (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(:issue:35416, :issue:41225) at the end of the sentence

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@geoffrey-eisenbarth geoffrey-eisenbarth requested a review from phofl May 15, 2021 02:27
@rhshadrach rhshadrach added this to the 1.3 milestone May 21, 2021
@rhshadrach rhshadrach merged commit b3e3352 into pandas-dev:master May 21, 2021
@rhshadrach
Copy link
Member

Thanks @geoffrey-eisenbarth!

@geoffrey-eisenbarth
Copy link
Contributor Author

@rhshadrach thanks for your guidance!

@geoffrey-eisenbarth geoffrey-eisenbarth deleted the excel-format-fixes branch May 21, 2021 01:47
TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: The XLS_SIGNATURE is too restrictive
5 participants