ENH: loosen XLS signature #41321

geoffrey-eisenbarth · 2021-05-04T21:04:13Z

closes ENH: The XLS_SIGNATURE is too restrictive #41225
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Added ability to check for multiple XLS signatures according to testing files available at Spreadsheet Project. Also defer raising an error when the engine is specified but the file signature does not match one of the values in SIGNATURES: this allows a user to attempt to specify and use an engine even if the passed file doesn't match one of the provided values in SIGNATURES.

This is my first PR for this project, so please let me know if more is expected (writing tests, writing a whatsnew entry, etc). I have run Flake8 on the code and successfully opened BIFF2 through BIFF8 files with this method. Thanks!

phofl

Please add tests

Also have to add whatsnew, but could start with tests

geoffrey-eisenbarth · 2021-05-05T12:58:43Z

@phofl Thanks for the heads up. I didn't set up the pandas dev environment properly so I'm going to try to take care of that today and address the failing tests.

geoffrey-eisenbarth · 2021-05-05T17:14:28Z

@phofl @rhshadrach I think I've got it mostly squared away, however the Azure pipeline is throwing an error because it can't import xlrd (in order to catch XLRDError). Do you have a suggestion for how to resolve this? Thanks!

phofl · 2021-05-05T17:54:11Z

xlrd is not installed in every ci, you should probably import this in your test runtime

phofl

Some cosmetic comments

phofl · 2021-05-05T17:54:39Z

doc/source/whatsnew/v1.2.5.rst

@@ -34,8 +34,7 @@ Bug fixes

 Other
 ~~~~~
-
-
+- Loosen XLS signatures used in :func:`read_excel` to determine if the `xlrd` engine should be used (:issue:`41225`)


Please move to 1.3, I think we have a section for IO excel or something like that

phofl · 2021-05-05T17:54:53Z

pandas/io/excel/_base.py



 @doc(storage_options=_shared_docs["storage_options"])
 def inspect_excel_format(
    content_or_path: FilePathOrBuffer,
    storage_options: StorageOptions = None,
-) -> str:
+) -> Optional[str]:


str | None instead of optional

geoffrey-eisenbarth · 2021-05-05T18:12:22Z

xlrd is not installed in every ci, you should probably import this in your test runtime

Just to clarify, do you mean the import should be moved to inside the testing class method? Will the test be bypassed if xlrd is not installed in the ci?

phofl · 2021-05-05T18:47:52Z

Yes into the if engine == xlrd
If you look at the beginning of the file you will See that the engines are constructed based on the installed packages

phofl · 2021-05-05T19:50:03Z

pandas/io/excel/_base.py



 @doc(storage_options=_shared_docs["storage_options"])
 def inspect_excel_format(
    content_or_path: FilePathOrBuffer,
    storage_options: StorageOptions = None,
-) -> str:
+) -> Union[str, None]:


Please str | None, this is the new mechanism to type this

Oh, thanks for the heads up: I looked around in the typing documentation and didn't see this feature. Will adjust now.

see https://docs.python.org/3.10/whatsnew/3.10.html#pep-604-new-type-union-operator if you are interested

geoffrey-eisenbarth · 2021-05-05T21:08:06Z

@phofl I have updated the existing related tests, should I also write new ones that verify that read_excel will read older XLS files, or are we good to go?

Thanks for your help and patience!

phofl · 2021-05-05T21:14:23Z

Please add new tests, if the old tests don't cover everything. Since you've added a bunch of new signatures we need test for every one of them

rhshadrach

Thanks for the PR! Looks really good, some minor requests below. Agree with @phofl on the need for tests.

rhshadrach · 2021-05-06T00:57:32Z

doc/source/whatsnew/v1.3.0.rst

@@ -831,6 +831,7 @@ I/O
 - Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
 - Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
 - Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
+- Loosen XLS signatures used in :func:`read_excel` to determine if the `xlrd` engine should be used (:issue:`41225`)


The usage of signatures is internal to the implementation; can you change to something like "Bug in read_excel not detecting older xls formats". If you'd like to call this an enhancement (I don't have an opinion either way), then this line should be moved to the enhancement section above.

Also, I think a separate line should be added addressing the bugfix here that we would raise when engine is specified but the excel format could not be determined.

rhshadrach · 2021-05-06T01:06:16Z

pandas/io/excel/_base.py

-    str
-        Format of file.
+    str or None
+        Format of file (if it can be determined)


nit: remove parentheses - the phrase is essential to the correctness of the statement and not supplemental. Also, period at the end.

geoffrey-eisenbarth · 2021-05-06T13:03:05Z

Thanks for the PR! Looks really good, some minor requests below. Agree with @phofl on the need for tests.

Thanks for the feedback! Hoping to add the tests today. I imagine it's considered bad form to write test that rely on downloading outside resources (specifically the various Excel files hosted at https://www.openoffice.org/sc/testdocs/). If that's the case, should the tests just use io.BytesIO objects that are created using the values in the SIGNATURES dictionary?

phofl · 2021-05-06T13:37:38Z

Yes, creating the tests without external files would be perfect

geoffrey-eisenbarth · 2021-05-06T18:02:01Z

I added a test, but I'm afraid it's probably not the best. I haven't been able to find blank BIFF2, BIFF3, BIFF4, BIFF5 files in order to observe their header and body bytes to make more accurate io.BytesIO streams to test against. I have examined files from the Spreadsheet Project's Test Documents, but since those files are not empty (and I imagine some of the first few bytes describe how long the document is, etc) I'm not sure it's the best solution.

If xlrd throws the struct.error exception, then the file has passed xlrd's type inspection and has been determined to be a valid BIFF file (otherwise it would throw a custom BIFF exception). I imagine the reason the struct.error gets thrown is because the length of the BytesIO does not match the length specified in the header/body bytes (although this is just speculation).

Any advice or suggestions would be greatly appreciated.

rhshadrach · 2021-05-08T13:03:29Z

pandas/io/excel/_base.py

+    "biff5": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",
+    "biff8": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",


Does the biff version matter? I think no - in which case can you make this a tuple with comments, e.g.

BIFF_SIGNATURES = ( ... b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1", # biff5 and biff8 ) ZIP_SIGNATURE = ...

rhshadrach · 2021-05-08T13:03:53Z

pandas/tests/io/excel/test_xlrd.py

+    import io
+    import struct
+
+    headers = {


Can you use pytest.mark.parametrize instead

rhshadrach · 2021-05-08T13:04:04Z

pandas/tests/io/excel/test_xlrd.py

+def test_read_old_xls_files():
+    # GH 41226
+    import io
+    import struct


Move imports to top of file

rhshadrach · 2021-05-08T13:07:13Z

pandas/tests/io/excel/test_xlrd.py

+    for file_format, header in headers.items():
+        f = io.BytesIO(header + body)
+        with pytest.raises(struct.error, match="unpack requires a buffer "):
+            # If struct.error is raised, file has passed xlrd's filetype checks


Let's just test inspect_excel_format is giving us xls for these

rhshadrach · 2021-05-08T13:12:21Z

doc/source/whatsnew/v1.3.0.rst

@@ -831,6 +830,7 @@ I/O
 - Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
 - Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
 - Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
+- :func:`read_excel` now raises the specified engine's exception for incorrect file types if the excel format cannot be determined by pandas (:issue:`41225`)


I think the more significant (and great!) change here is that we are trying the engine even when we can't detect the file type when the user specifies it. Something like

Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the engine argument

rhshadrach · 2021-05-08T13:35:05Z

pandas/tests/io/excel/test_xlrd.py

+        "biff4": b"\x09\x04\x06\x00\x00\x00\x10\x00",
+        "biff5": b"\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1",
+    }
+    body = b"\x00" * 16 + b"\x3e\x00\x03\x00\xfe\xff"  # Required for biff5


If this is required for biff5, then do we want to add that to the signatures in _base?

Upon more research, the BIFF5 (and ongoing) filetypes are actually Compound Binary Files (CFB, which have the "DOCFILE" hex header). This US Library of Congress page shows that the CFB header requires extra bits to eventually specify the endianness of the file, which is the error xlrd was giving me if I didn't include the extra body variable referenced above. However, this is no longer necessary since we're testing whether inspect_excel_format returns xls or not.

attack68 · 2021-05-11T12:53:19Z

pandas/io/excel/_base.py

                raise ValueError(
                    f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
                    f"only the xls format is supported. Install openpyxl instead."
                )
-            elif ext != "xls":
+            elif ext and ext != "xls":


if engine="xlrd" and ext=None and is not updated via inspect_excel_format due to a bad format, what happens here? It seems the error is pushed down to self._engines[engine](self._io, storage_options)

if this is genuine oversight then a test might be worth it for this case, and an errormessage, otherwise can ignore..

@attack68 thanks for looking over this! My understanding of the desired behavior was to go ahead and try reading it with the xlrd engine even if pandas can't determine the filetype (i.e., ext=None). I suppose I could add an else case to the if statement you're referencing that produces a warning?

Any input is appreciated, thanks!

maybe.. not sure what is the best solution really. just wanted to make sure the case was considered... someone else might have a good idea or just leave it with that which you stated!

If the user is specifying the engine explicitly, I do not think we should raise an error/warning. They have already instructed what they want to do.

geoffrey-eisenbarth · 2021-05-13T13:35:32Z

@phofl @rhshadrach I believe I have addressed all the concerns (please let me know if you're still expecting anything). Should I click the "re-request review" buttons for your two requests? Thanks!

phofl · 2021-05-13T13:37:14Z

Could you merge master? I'll leave the final go to @rhshadrach

…nbarth/pandas into excel-format-fixes

rhshadrach

lgtm, minor requests on the whatsnew.

rhshadrach · 2021-05-14T02:26:35Z

doc/source/whatsnew/v1.3.0.rst

@@ -843,6 +843,8 @@ I/O
 - Bug in :func:`read_csv` and :func:`read_excel` not respecting dtype for duplicated column name when ``mangle_dupe_cols`` is set to ``True`` (:issue:`35211`)
 - Bug in :func:`read_csv` and :func:`read_table` misinterpreting arguments when ``sys.setprofile`` had been previously called (:issue:`41069`)
 - Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
+- Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the ``engine`` argument (:issue:`41225`)


:func:`read_excel`

rhshadrach · 2021-05-14T02:27:07Z

doc/source/whatsnew/v1.3.0.rst

@@ -197,7 +197,7 @@ Other enhancements
 - Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`)
 - :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
 - Add support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`)
+- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) and older .xls files (:issue:`41225`)


(:issue:35416, :issue:41225) at the end of the sentence

rhshadrach

lgtm

rhshadrach · 2021-05-21T01:44:30Z

Thanks @geoffrey-eisenbarth!

geoffrey-eisenbarth · 2021-05-21T01:47:00Z

@rhshadrach thanks for your guidance!

Fix #41225.

c4bf8fb

phofl requested changes May 4, 2021

View reviewed changes

jreback changed the title ~~Fix #41225.~~ ENH: loosen XLS signature May 5, 2021

jreback added Enhancement IO Excel read_excel, to_excel labels May 5, 2021

geoffrey-eisenbarth added 5 commits May 5, 2021 08:59

Adjust return type.

95bf325

Update tests.

416ddd9

Properly stylize strings.

44053bb

Correct expected exception message.

1b1b648

Add relevant whatsnew entry.

3c11c74

phofl reviewed May 5, 2021

View reviewed changes

Address requested cosmetic changes.

9d0e8fa

Import exception when needed.

3d79808

phofl reviewed May 5, 2021

View reviewed changes

Use new type hint mechanism for multiple output types.

cbfa563

rhshadrach requested changes May 6, 2021

View reviewed changes

Address minor documentation recommendations.

c113b20

Add tests.

f023c9d

rhshadrach requested changes May 8, 2021

View reviewed changes

Reword based on feedback.

955f679

geoffrey-eisenbarth added 2 commits May 10, 2021 14:31

Refactor to test whether returns .

d6206bc

Refactor to remove extraneous key information into comments.

c306f69

attack68 reviewed May 11, 2021

View reviewed changes

geoffrey-eisenbarth and others added 4 commits May 13, 2021 08:49

Merge branch 'master' into excel-format-fixes

c741f32

Address pre-commit.

8cc480e

Pre-commit fixes.

03d6393

Merge branch 'excel-format-fixes' of https://github.com/geoffrey-eise…

32b9d1d

…nbarth/pandas into excel-format-fixes

rhshadrach requested changes May 14, 2021

View reviewed changes

Address requested changes in the whatsnew.

9e9a8ff

geoffrey-eisenbarth requested a review from rhshadrach May 14, 2021 19:48

rhshadrach approved these changes May 15, 2021

View reviewed changes

geoffrey-eisenbarth requested a review from phofl May 15, 2021 02:27

rhshadrach added this to the 1.3 milestone May 21, 2021

rhshadrach merged commit b3e3352 into pandas-dev:master May 21, 2021

geoffrey-eisenbarth deleted the excel-format-fixes branch May 21, 2021 01:47

TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021

ENH: loosen XLS signature (pandas-dev#41321)

9873fc9

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

ENH: loosen XLS signature (pandas-dev#41321)

483ee44

		"biff5": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",
		"biff8": b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1",

ENH: loosen XLS signature #41321

ENH: loosen XLS signature #41321

Conversation

geoffrey-eisenbarth commented May 4, 2021 • edited Loading

phofl left a comment • edited Loading

Choose a reason for hiding this comment

geoffrey-eisenbarth commented May 5, 2021

geoffrey-eisenbarth commented May 5, 2021

phofl commented May 5, 2021

phofl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffrey-eisenbarth commented May 5, 2021

phofl commented May 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffrey-eisenbarth commented May 5, 2021

phofl commented May 5, 2021

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffrey-eisenbarth commented May 6, 2021

phofl commented May 6, 2021

geoffrey-eisenbarth commented May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach May 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffrey-eisenbarth commented May 13, 2021

phofl commented May 13, 2021

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented May 21, 2021

geoffrey-eisenbarth commented May 21, 2021

geoffrey-eisenbarth commented May 4, 2021 •

edited

Loading

phofl left a comment •

edited

Loading

geoffrey-eisenbarth commented May 6, 2021 •

edited

Loading

rhshadrach May 8, 2021 •

edited

Loading