Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't write + read null Array type data #15130

Closed
2 tasks done
akhilles opened this issue Mar 18, 2024 · 5 comments · Fixed by #20720
Closed
2 tasks done

Can't write + read null Array type data #15130

akhilles opened this issue Mar 18, 2024 · 5 comments · Fixed by #20720
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@akhilles
Copy link

akhilles commented Mar 18, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], None]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)
print(df)
df.write_parquet("repro.parquet")
df = pl.read_parquet("repro.parquet")
print(df)

Log output

shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ null          │
└───────────────┴───────────────┘
thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:513:13:
The column lengths in the DataFrame are not equal.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/akhilles/src/pl-repro/repro.py", line 16, in <module>
    print(df)
  File "/Users/akhilles/src/pl-repro/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 1517, in __str__
    return self._df.as_str()
           ^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: The column lengths in the DataFrame are not equal.

Issue description

Null Array entries are dropped when the Parquet file is written.

Expected behavior

The dataframe should be identical after writing to and reading from parquet.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             macOS-14.4-arm64-arm-64bit
Python:               3.12.2 (main, Mar  6 2024, 16:17:39) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              15.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@akhilles akhilles added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 18, 2024
@akhilles
Copy link
Author

akhilles commented Mar 18, 2024

I'm not able to reproduce this issue with JSON write+read:

import polars as pl

pl.show_versions()
df = pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], None]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)
print(df)
df.write_json("repro.json")
df = pl.read_json("repro.json")
print(df)

Output:

shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ null          │
└───────────────┴───────────────┘
shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ null          │
└───────────────┴───────────────┘

@akhilles
Copy link
Author

With only a single column, we can see the null entry is dropped instead:

shape: (2, 1)
┌───────────────┐
│ Array_1       │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [1, 3]        │
│ null          │
└───────────────┘
shape: (1, 1)
┌───────────────┐
│ Array_1       │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [1, 3]        │
└───────────────┘

@akhilles
Copy link
Author

This appears to be a more fundamental limitation with arrow + parquet: apache/arrow#24425. Until this is supported, I think failing the write_parquet operation when there are nulls is slightly better behavior.

@loicmagne
Copy link

Had the same issue, if this a fundamental limitation of the pl.Array type, I think it should be properly mentioned in the docs?

@lukemanley
Copy link
Contributor

This is fixed and tested via test_parquet_array_dtype_nulls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants