Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: loosen XLS signature #41321

Merged
merged 19 commits into from
May 21, 2021
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ Other enhancements
- Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`)
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
- Add support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`)
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) and older .xls files (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(:issue:35416, :issue:41225) at the end of the sentence

- :class:`pandas.ExcelWriter` now accepts an ``if_sheet_exists`` parameter to control the behaviour of append mode when writing to existing sheets (:issue:`40230`)
- :meth:`.Rolling.sum`, :meth:`.Expanding.sum`, :meth:`.Rolling.mean`, :meth:`.Expanding.mean`, :meth:`.ExponentialMovingWindow.mean`, :meth:`.Rolling.median`, :meth:`.Expanding.median`, :meth:`.Rolling.max`, :meth:`.Expanding.max`, :meth:`.Rolling.min`, and :meth:`.Expanding.min` now support ``Numba`` execution with the ``engine`` keyword (:issue:`38895`, :issue:`41267`)
- :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
Expand Down Expand Up @@ -224,7 +224,6 @@ Other enhancements
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`)
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`)
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
-

.. ---------------------------------------------------------------------------

Expand Down Expand Up @@ -831,6 +830,8 @@ I/O
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
- Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the ``engine`` argument (:issue:`41225`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:func:`read_excel`

-

Period
^^^^^^
Expand Down
30 changes: 20 additions & 10 deletions pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1014,16 +1014,21 @@ def close(self):
return content


XLS_SIGNATURE = b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
XLS_SIGNATURES = (
b"\x09\x00\x04\x00\x07\x00\x10\x00", # BIFF2
b"\x09\x02\x06\x00\x00\x00\x10\x00", # BIFF3
b"\x09\x04\x06\x00\x00\x00\x10\x00", # BIFF4
b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1", # Compound File Binary
)
ZIP_SIGNATURE = b"PK\x03\x04"
PEEK_SIZE = max(len(XLS_SIGNATURE), len(ZIP_SIGNATURE))
PEEK_SIZE = max(map(len, XLS_SIGNATURES + (ZIP_SIGNATURE, )))


@doc(storage_options=_shared_docs["storage_options"])
def inspect_excel_format(
content_or_path: FilePathOrBuffer,
storage_options: StorageOptions = None,
) -> str:
) -> str | None:
"""
Inspect the path or content of an excel file and get its format.

Expand All @@ -1037,8 +1042,8 @@ def inspect_excel_format(

Returns
-------
str
Format of file.
str or None
Format of file if it can be determined.

Raises
------
Expand All @@ -1063,10 +1068,10 @@ def inspect_excel_format(
peek = buf
stream.seek(0)

if peek.startswith(XLS_SIGNATURE):
if any(peek.startswith(sig) for sig in XLS_SIGNATURES):
return "xls"
elif not peek.startswith(ZIP_SIGNATURE):
raise ValueError("File is not a recognized excel file")
return None

# ZipFile typing is overly-strict
# https://github.com/python/typeshed/issues/4212
Expand Down Expand Up @@ -1174,8 +1179,12 @@ def __init__(
ext = inspect_excel_format(
content_or_path=path_or_buffer, storage_options=storage_options
)
if ext is None:
raise ValueError(
"Excel file format cannot be determined, you must specify "
"an engine manually."
)

# ext will always be valid, otherwise inspect_excel_format would raise
engine = config.get_option(f"io.excel.{ext}.reader", silent=True)
if engine == "auto":
engine = get_default_engine(ext, mode="reader")
Expand All @@ -1190,12 +1199,13 @@ def __init__(
path_or_buffer, storage_options=storage_options
)

if ext != "xls" and xlrd_version >= "2":
# Pass through if ext is None, otherwise check if ext valid for xlrd
if ext and ext != "xls" and xlrd_version >= "2":
raise ValueError(
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
f"only the xls format is supported. Install openpyxl instead."
)
elif ext != "xls":
elif ext and ext != "xls":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if engine="xlrd" and ext=None and is not updated via inspect_excel_format due to a bad format, what happens here? It seems the error is pushed down to self._engines[engine](self._io, storage_options)

if this is genuine oversight then a test might be worth it for this case, and an errormessage, otherwise can ignore..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attack68 thanks for looking over this! My understanding of the desired behavior was to go ahead and try reading it with the xlrd engine even if pandas can't determine the filetype (i.e., ext=None). I suppose I could add an else case to the if statement you're referencing that produces a warning?

Any input is appreciated, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe.. not sure what is the best solution really. just wanted to make sure the case was considered... someone else might have a good idea or just leave it with that which you stated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is specifying the engine explicitly, I do not think we should raise an error/warning. They have already instructed what they want to do.

caller = inspect.stack()[1]
if (
caller.filename.endswith(
Expand Down
14 changes: 12 additions & 2 deletions pandas/tests/io/excel/test_readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -718,9 +718,19 @@ def test_missing_file_raises(self, read_ext):

def test_corrupt_bytes_raises(self, read_ext, engine):
bad_stream = b"foo"
if engine is None or engine == "xlrd":
if engine is None:
error = ValueError
msg = "File is not a recognized excel file"
msg = (
"Excel file format cannot be determined, you must "
"specify an engine manually."
)
elif engine == "xlrd":
from xlrd import XLRDError
error = XLRDError
msg = (
"Unsupported format, or corrupt file: Expected BOF "
"record; found b'foo'"
)
else:
error = BadZipFile
msg = "File is not a zip file"
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/io/excel/test_xlrd.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import io
import pytest

from pandas.compat._optional import import_optional_dependency
Expand All @@ -7,6 +8,7 @@
from pandas.tests.io.excel import xlrd_version

from pandas.io.excel import ExcelFile
from pandas.io.excel._base import inspect_excel_format

xlrd = pytest.importorskip("xlrd")
xlwt = pytest.importorskip("xlwt")
Expand Down Expand Up @@ -77,3 +79,18 @@ def test_read_excel_warning_with_xlsx_file(datapath):
else:
with tm.assert_produces_warning(None):
pd.read_excel(path, "Sheet1", engine=None)


@pytest.mark.parametrize(
"file_header",
[
b"\x09\x00\x04\x00\x07\x00\x10\x00",
b"\x09\x02\x06\x00\x00\x00\x10\x00",
b"\x09\x04\x06\x00\x00\x00\x10\x00",
b"\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1",
]
)
def test_read_old_xls_files(file_header):
# GH 41226
f = io.BytesIO(file_header)
assert inspect_excel_format(f) == "xls"