Can't write + read null `Array` type data #15130

akhilles · 2024-03-18T16:21:46Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], None]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)
print(df)
df.write_parquet("repro.parquet")
df = pl.read_parquet("repro.parquet")
print(df)

Log output

shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ null          │
└───────────────┴───────────────┘
thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:513:13:
The column lengths in the DataFrame are not equal.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/akhilles/src/pl-repro/repro.py", line 16, in <module>
    print(df)
  File "/Users/akhilles/src/pl-repro/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 1517, in __str__
    return self._df.as_str()
           ^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: The column lengths in the DataFrame are not equal.

Issue description

Null Array entries are dropped when the Parquet file is written.

Expected behavior

The dataframe should be identical after writing to and reading from parquet.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             macOS-14.4-arm64-arm-64bit
Python:               3.12.2 (main, Mar  6 2024, 16:17:39) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              15.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

akhilles · 2024-03-18T16:25:39Z

I'm not able to reproduce this issue with JSON write+read:

import polars as pl

pl.show_versions()
df = pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], None]),
    ],
    schema={
        "Array_1": pl.Array(pl.Int64, 2),
        "Array_2": pl.Array(pl.Int64, 3),
    },
)
print(df)
df.write_json("repro.json")
df = pl.read_json("repro.json")
print(df)

Output:

shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ null          │
└───────────────┴───────────────┘
shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ null          │
└───────────────┴───────────────┘

akhilles · 2024-03-18T16:31:35Z

With only a single column, we can see the null entry is dropped instead:

shape: (2, 1)
┌───────────────┐
│ Array_1       │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [1, 3]        │
│ null          │
└───────────────┘
shape: (1, 1)
┌───────────────┐
│ Array_1       │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [1, 3]        │
└───────────────┘

akhilles · 2024-03-18T18:09:15Z

This appears to be a more fundamental limitation with arrow + parquet: apache/arrow#24425. Until this is supported, I think failing the write_parquet operation when there are nulls is slightly better behavior.

loicmagne · 2024-06-13T22:54:22Z

Had the same issue, if this a fundamental limitation of the pl.Array type, I think it should be properly mentioned in the docs?

lukemanley · 2025-01-13T01:01:56Z

This is fixed and tested via test_parquet_array_dtype_nulls

akhilles added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 18, 2024

lukemanley mentioned this issue Jan 15, 2025

test: Add tests for various open issues #20720

Merged

ritchie46 closed this as completed in #20720 Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't write + read null `Array` type data #15130

Can't write + read null `Array` type data #15130

akhilles commented Mar 18, 2024 •

edited

Loading

akhilles commented Mar 18, 2024 •

edited

Loading

akhilles commented Mar 18, 2024

akhilles commented Mar 18, 2024

loicmagne commented Jun 13, 2024

lukemanley commented Jan 13, 2025

Can't write + read null Array type data #15130

Can't write + read null Array type data #15130

Comments

akhilles commented Mar 18, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

akhilles commented Mar 18, 2024 • edited Loading

akhilles commented Mar 18, 2024

akhilles commented Mar 18, 2024

loicmagne commented Jun 13, 2024

lukemanley commented Jan 13, 2025

Can't write + read null `Array` type data #15130

Can't write + read null `Array` type data #15130

akhilles commented Mar 18, 2024 •

edited

Loading

akhilles commented Mar 18, 2024 •

edited

Loading