Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when concatenating incompatible schemas in streaming engine #19528

Closed
2 tasks done
pomo-mondreganto opened this issue Oct 30, 2024 · 0 comments · Fixed by #21075
Closed
2 tasks done

Panic when concatenating incompatible schemas in streaming engine #19528

pomo-mondreganto opened this issue Oct 30, 2024 · 0 comments · Fixed by #21075
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@pomo-mondreganto
Copy link
Contributor

pomo-mondreganto commented Oct 30, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pandas as pd
import polars as pl

df1 = pd.DataFrame([{"id": 123, "s": "abacaba"}])
df2 = pd.DataFrame([{"id": 1}]).merge(pd.DataFrame([{"id": 2, "s": "foobar"}]), on=["id"], how="left")

df1.to_parquet('1.parquet')
df2.to_parquet('2.parquet')

# df1 schema:
# message schema {
#   OPTIONAL INT64 id;
#   OPTIONAL BYTE_ARRAY s (STRING);
# }

# df2 schema (pandas infers it quite strangely, yes):
# message schema {
#   OPTIONAL INT64 id;
#   OPTIONAL INT32 s (UNKNOWN);
# }

lf1 = pl.scan_parquet('1.parquet')
lf2 = pl.scan_parquet('2.parquet')

# pl.concat([lf1, lf2]).collect(streaming=True) # doesn't panic
# pl.concat([lf2, lf1]).collect(streaming=False) # doesn't panic
print(pl.concat([lf2, lf1]).collect(streaming=True)) # panics

Log output

POLARS PREFETCH_SIZE: 20
RUN STREAMING PIPELINE
[union -> ordered_sink]
STREAMING CHUNK SIZE: 25000 rows
thread '<unnamed>' panicked at crates/polars-core/src/frame/mod.rs:950:36:
should not fail: SchemaMismatch(ErrString("type String is incompatible with expected type Null"))
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[20], line 1
----> 1 pl.concat([lf2, lf1]).collect(streaming=True)

File ~/miniconda3/envs/dirty3.12/lib/python3.12/site-packages/polars/lazyframe/frame.py:2034, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2032 # Only for testing purposes
   2033 callback = _kwargs.get("post_opt_callback", callback)
-> 2034 return wrap_df(ldf.collect(callback))

PanicException: should not fail: SchemaMismatch(ErrString("type String is incompatible with expected type Null"))

Issue description

The use case is obviously wrong, but it shouldn't panic. The issue is more serious in Rust, as this panic occurs inside rayon pool and it just calls abort, so it's impossible to catch and handle it.

The issue happens with streaming enabled only, otherwise it reports a correct SchemaError: type String is incompatible with expected type Null.

Expected behavior

Doesn't panic

Installed versions

--------Version info---------
Polars:              1.7.0
Index type:          UInt32
Platform:            macOS-14.5-arm64-arm-64bit
Python:              3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager  1.0.0
altair               <not installed>
cloudpickle          3.0.0
connectorx           0.3.3
deltalake            <not installed>
fastexcel            0.10.4
fsspec               2023.12.2
gevent               24.2.1
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              16.1.0
pydantic             2.7.1
pyiceberg            0.6.1
sqlalchemy           2.0.30
torch                <not installed>
xlsx2csv             0.8.2
xlsxwriter           3.2.0
@pomo-mondreganto pomo-mondreganto added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant