Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max() on empty LazyFrame returned by scan_delta() causes PanicException #19890

Closed
2 tasks done
ramonvermeulen opened this issue Nov 20, 2024 · 0 comments · Fixed by #19884
Closed
2 tasks done

max() on empty LazyFrame returned by scan_delta() causes PanicException #19890

ramonvermeulen opened this issue Nov 20, 2024 · 0 comments · Fixed by #19884
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@ramonvermeulen
Copy link

ramonvermeulen commented Nov 20, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When I create a new delta table via Databricks with the following SQL:

CREATE TABLE IF NOT EXISTS does.not.matter (
  ts TIMESTAMP_NTZ,
  loc STRING
) USING delta
LOCATION 'gs://path/to/delta/table'
TBLPROPERTIES ('delta.enableDeletionVectors' = false);

And when I then try to get the max timestamp via Polars:

a = pl.scan_delta(
    source="gs://path/to/delta/table",
    storage_options=self.storage_options,
)
.select(pl.col("ts").max())
.collect()

I get a PanicException

When I then insert one record via:

INSERT INTO does.not.matter (ts, loc) VALUES ("2024-11-20 12:34:56", "A");

And try it again, it works fine.

print(a)
shape: (1, 1)
┌─────────────────────┐
│ ts                  │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2024-11-20 12:34:56 │
└─────────────────────┘

Also, when I try to do this on an entirely fresh LazyFrame (not created via scan_delta) it seems to work fine:

lf = pl.LazyFrame({"ts": pl.Series([], dtype=pl.Datetime)})
a = lf.select(pl.col("ts").max()).collect()
print(a)
shape: (1, 1)
┌──────────────┐
│ ts           │
│ ---          │
│ datetime[μs] │
╞══════════════╡
│ null         │
└──────────────┘

So probably it is related to how scan_delta() creates a LazyFrame? In code I could check for records first as a workaround, however the PanicException seems not like desired behavior to me.

Log output

Traceback (most recent call last):
  File "/home/ramon/.pycharm_helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 7, in <module>
  File "/home/ramon/projects/work/xxxx/.venv/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
thread '<unnamed>' panicked at /home/runner/xxxx/polars/polars/crates/polars-core/src/utils/mod.rs:743:34:
called `Option::unwrap()` on a `None` value

Issue description

Aggregation (at least max() but probably it is the same for other aggregation functions) on top of an empty LazyFrame returned from scan_delta() throws a PanicException.

Expected behavior

I would expect to return an empty DataFrame, just like the 2nd example where I created my own LazyFrame instead of throwing a PanicException. Want to check first if this is indeed a bug and not desired functionality, most likely it is specific to scan_delta().

Installed versions

pl.show_versions()
--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.12.7 (main, Oct  7 2024, 22:29:45) [GCC 13.2.0]
LTS CPU:             False
----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            0.21.0
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          2.36.0
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.3
pyarrow              16.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@ramonvermeulen ramonvermeulen added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 20, 2024
@ramonvermeulen ramonvermeulen changed the title max() on empty LazyFrame returned by scan_delta() causes Panic max() on empty LazyFrame returned by scan_delta() causes PanicException Nov 20, 2024
@c-peters c-peters added the accepted Ready for implementation label Nov 25, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Nov 25, 2024
@c-peters c-peters moved this from Ready to Done in Backlog Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
3 participants