-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: loosen XLS signature #41321
ENH: loosen XLS signature #41321
Changes from 14 commits
c4bf8fb
95bf325
416ddd9
44053bb
1b1b648
3c11c74
9d0e8fa
3d79808
cbfa563
c113b20
f023c9d
955f679
d6206bc
c306f69
c741f32
8cc480e
03d6393
32b9d1d
9e9a8ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -195,7 +195,7 @@ Other enhancements | |
- Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`) | ||
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`) | ||
- Add support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`) | ||
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) | ||
- :func:`pandas.read_excel` can now auto detect .xlsb files (:issue:`35416`) and older .xls files (:issue:`41225`) | ||
- :class:`pandas.ExcelWriter` now accepts an ``if_sheet_exists`` parameter to control the behaviour of append mode when writing to existing sheets (:issue:`40230`) | ||
- :meth:`.Rolling.sum`, :meth:`.Expanding.sum`, :meth:`.Rolling.mean`, :meth:`.Expanding.mean`, :meth:`.ExponentialMovingWindow.mean`, :meth:`.Rolling.median`, :meth:`.Expanding.median`, :meth:`.Rolling.max`, :meth:`.Expanding.max`, :meth:`.Rolling.min`, and :meth:`.Expanding.min` now support ``Numba`` execution with the ``engine`` keyword (:issue:`38895`, :issue:`41267`) | ||
- :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`) | ||
|
@@ -224,7 +224,6 @@ Other enhancements | |
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`) | ||
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`) | ||
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`) | ||
- | ||
|
||
.. --------------------------------------------------------------------------- | ||
|
||
|
@@ -831,6 +830,8 @@ I/O | |
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`) | ||
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`) | ||
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`) | ||
- Bug in read_excel would raise an error when pandas could not determine the file type, even when user specified the ``engine`` argument (:issue:`41225`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
- | ||
|
||
Period | ||
^^^^^^ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1014,16 +1014,21 @@ def close(self): | |
return content | ||
|
||
|
||
XLS_SIGNATURE = b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1" | ||
XLS_SIGNATURES = ( | ||
b"\x09\x00\x04\x00\x07\x00\x10\x00", # BIFF2 | ||
b"\x09\x02\x06\x00\x00\x00\x10\x00", # BIFF3 | ||
b"\x09\x04\x06\x00\x00\x00\x10\x00", # BIFF4 | ||
b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1", # Compound File Binary | ||
) | ||
ZIP_SIGNATURE = b"PK\x03\x04" | ||
PEEK_SIZE = max(len(XLS_SIGNATURE), len(ZIP_SIGNATURE)) | ||
PEEK_SIZE = max(map(len, XLS_SIGNATURES + (ZIP_SIGNATURE, ))) | ||
|
||
|
||
@doc(storage_options=_shared_docs["storage_options"]) | ||
def inspect_excel_format( | ||
content_or_path: FilePathOrBuffer, | ||
storage_options: StorageOptions = None, | ||
) -> str: | ||
) -> str | None: | ||
""" | ||
Inspect the path or content of an excel file and get its format. | ||
|
||
|
@@ -1037,8 +1042,8 @@ def inspect_excel_format( | |
|
||
Returns | ||
------- | ||
str | ||
Format of file. | ||
str or None | ||
Format of file if it can be determined. | ||
|
||
Raises | ||
------ | ||
|
@@ -1063,10 +1068,10 @@ def inspect_excel_format( | |
peek = buf | ||
stream.seek(0) | ||
|
||
if peek.startswith(XLS_SIGNATURE): | ||
if any(peek.startswith(sig) for sig in XLS_SIGNATURES): | ||
return "xls" | ||
elif not peek.startswith(ZIP_SIGNATURE): | ||
raise ValueError("File is not a recognized excel file") | ||
return None | ||
|
||
# ZipFile typing is overly-strict | ||
# https://github.com/python/typeshed/issues/4212 | ||
|
@@ -1174,8 +1179,12 @@ def __init__( | |
ext = inspect_excel_format( | ||
content_or_path=path_or_buffer, storage_options=storage_options | ||
) | ||
if ext is None: | ||
raise ValueError( | ||
"Excel file format cannot be determined, you must specify " | ||
"an engine manually." | ||
) | ||
|
||
# ext will always be valid, otherwise inspect_excel_format would raise | ||
engine = config.get_option(f"io.excel.{ext}.reader", silent=True) | ||
if engine == "auto": | ||
engine = get_default_engine(ext, mode="reader") | ||
|
@@ -1190,12 +1199,13 @@ def __init__( | |
path_or_buffer, storage_options=storage_options | ||
) | ||
|
||
if ext != "xls" and xlrd_version >= "2": | ||
# Pass through if ext is None, otherwise check if ext valid for xlrd | ||
if ext and ext != "xls" and xlrd_version >= "2": | ||
raise ValueError( | ||
f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, " | ||
f"only the xls format is supported. Install openpyxl instead." | ||
) | ||
elif ext != "xls": | ||
elif ext and ext != "xls": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if if this is genuine oversight then a test might be worth it for this case, and an errormessage, otherwise can ignore.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @attack68 thanks for looking over this! My understanding of the desired behavior was to go ahead and try reading it with the Any input is appreciated, thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe.. not sure what is the best solution really. just wanted to make sure the case was considered... someone else might have a good idea or just leave it with that which you stated! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the user is specifying the engine explicitly, I do not think we should raise an error/warning. They have already instructed what they want to do. |
||
caller = inspect.stack()[1] | ||
if ( | ||
caller.filename.endswith( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(:issue:
35416
, :issue:41225
) at the end of the sentence