Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct with decimals not read properly in parquet #16692

Closed
2 tasks done
theelderbeever opened this issue Jun 3, 2024 · 8 comments · Fixed by #20720
Closed
2 tasks done

Struct with decimals not read properly in parquet #16692

theelderbeever opened this issue Jun 3, 2024 · 8 comments · Fixed by #20720
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@theelderbeever
Copy link

theelderbeever commented Jun 3, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Marked this as a python bug since that is where I encountered it however, I would expect the same bug to exist in Rust.

Reproducible example

Minimum reproducible example that I can figure out. Removal of ANY row/field or unnesting the top level struct results in a success.

import polars as pl

data = [
    {
        "plan": {
            "metadata": {"a": 1},
            "tiers": [
                {
                    "unit_amount_decimal": "0.0001",
                }
            ],
        }
    },
    {
        "plan": {
            "metadata": {"a": 1},
            "tiers": [
                # {
                #     "unit_amount_decimal": "0",
                # },
                {
                    "unit_amount_decimal": "0.0001",
                },
            ],
        }
    },
]

pl.DataFrame(data).write_parquet("test.parquet")

Table

plan { struct[2] }
{{1},[{"0.0001"}]}
{{1},[{"0.0001"}]}

Log output

❯ RUST_BACKTRACE=1 POLARS_VERBOSE=1 python notebooks/test.py
thread '<unnamed>' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n                         However, the values at index 1 have a length of 3, which is different from values at index 0, 0."))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/taylorbeever/git/quiknode-labs/billing/billing-platform-pipelines/notebooks/test.py", line 35, in <module>
    pl.DataFrame(data).write_parquet("test.parquet")
  File "/Users/taylorbeever/.pyenv/versions/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py", line 3292, in write_parquet
    self._df.write_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("The children must have an equal number of values.\n                         However, the values at index 1 have a length of 3, which is different from values at index 0, 0."))

Issue description

I am attempting to write out a parquet file of data that I fetched from the Stripe api. The api json response is extremely nested. When writing the data structure in the example the write fails due to a differing number of children. If use_pyarrow=True is set then the write will be successful.

From trial and error it seems to very specifically require a column which is a struct containing a struct field and a list field. Any values deeper than col.struct.{struct,list} don't appear to affect the outcome and the list can in fact be empty and it will still fail.

Expected behavior

Dataframe should write to parquet successfully.

Installed versions

--------Version info---------
Polars:               0.20.30
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  1.0.0
cloudpickle:          2.2.1
connectorx:           0.3.3
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.29
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@theelderbeever theelderbeever added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 3, 2024
@theelderbeever
Copy link
Author

Also read/write between the use_pyarrow=True is equivalent

df = pl.DataFrame(data)
df.write_parquet("test.parquet", use_pyarrow=True)
df == pl.read_parquet("test.parquet")
plan { bool }
true
true

@cmdlineluser
Copy link
Contributor

This is fixed in 0.20.31

@theelderbeever
Copy link
Author

@cmdlineluser I completely didn't catch that there was a release just 2 days ago... Just upgraded.

@theelderbeever
Copy link
Author

theelderbeever commented Jun 3, 2024

@cmdlineluser Still broken for read operations when the internal values are Decimals AND some other type.

from decimal import Decimal
print(pl.__version__)

pl.Config.activate_decimals(True)

df = pl.DataFrame(
    [
        {
            "tiers": [
                {
                    "in_tier": 10.0,
                    "overage_cents": Decimal("0E-12"),
                },
                {
                    "in_tier": 0.0,
                    "overage_cents": Decimal("0E-12"),
                },
            ]
        },
        {
            "tiers": [
                {
                    "in_tier": 10.0,
                    "overage_cents": Decimal("0.001000000000"),
                }
            ]
        },
    ]
)

print(df.schema)

df.write_parquet("tiers.parquet")
pl.read_parquet("tiers.parquet")

@theelderbeever
Copy link
Author

theelderbeever commented Jun 3, 2024

Additionally, the decimal values inside the struct aren't being written or read from the file... use_pyarrow=True during the write correctly writes the decimal values.

from decimal import Decimal
print(pl.__version__)

pl.Config.activate_decimals(True)

df = pl.DataFrame(
    [
        {
            "tiers": [
                {
                    # "in_tier": 10.0,
                    "overage_cents": Decimal("0E-12"),
                },
                {
                    # "in_tier": 0.0,
                    "overage_cents": Decimal("0E-12"),
                },
            ]
        },
        {
            "tiers": [
                {
                    # "in_tier": 10.0,
                    "overage_cents": Decimal("0.001000000000"),
                }
            ]
        },
    ]
)

print(df.schema)

print(df)

df.write_parquet("tiers.parquet")
print(pl.read_parquet("tiers.parquet"))

"""
0.20.31
OrderedDict([('tiers', List(Struct({'overage_cents': Decimal(precision=None, scale=12)})))])
| tiers                                |
| ---                                  |
| list[struct[1]]                      |
|--------------------------------------|
| [{0.000000000000}, {0.000000000000}] |
| [{0.001000000000}]                   |

| tiers           |
| ---             |
| list[struct[1]] |
|-----------------|
"""

@cmdlineluser
Copy link
Contributor

D'oh - apologies.

Just for reference, the previous report was

(But wasn't decimal related.)

@theelderbeever
Copy link
Author

@cmdlineluser no worries. Want me to open a separate issue for decimals specifically?

@ritchie46 ritchie46 changed the title "The children must have an equal number of values" error when writing parquet with nested values and nulls Struct with decimals not read properly Jun 4, 2024
@ritchie46 ritchie46 added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jun 4, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 4, 2024
@ritchie46 ritchie46 changed the title Struct with decimals not read properly Struct with decimals not read properly in parquet Jun 4, 2024
@lukemanley
Copy link
Contributor

lukemanley commented Dec 27, 2024

All of the examples above seem to now be working on main and tested, e.g. test_nested_decimal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants