write_parquet encoding no longer recognized by PBI Service parquet connector after Polars 1.5.0 onwards #18819

darrylthom · 2024-09-18T15:42:14Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df.write_parquet("data.parquet")

Log output

No response

Issue description

Just to explain the setup a bit:
Parquet gets written to a network drive. Report published to PBI Service connects to this parquet file using an on-premises gateway.

Refreshing works on local copy of PBI file, but through PBI Service specifically, it is now giving an error:
Data source error: {"error":{"code":"DM_GWPipeline_Gateway_MashupDataAccessError","pbi.error":{"code":"DM_GWPipeline_Gateway_MashupataAccessError","parameters":{},,"details":[{"code":"DM_errorDetailNameCode_UnderlyingErrorCode","detail":{"type":1,"value":"-2147467259"}},{"code":"DM_ErrorDetailNameCode_UnderlyingErrorMessage","detail":{"type":1,"value":"Parquet: class parquet::ParquetException (message: 'Unknown encoding type.'"}}, {"code": "DM_ErrorDetailNameCode_UnderlyingHResult", "detail":{"type":1,"value":"-2147467259"}},"code":"Microsoft.Data.Mashup.ValueError.Reason","detail":{"type":1,"value":"DataFormat.Error"}}]"eceptionCulprit":1}}}

This refreshes fine locally -- the problem is PBI Service specifically. I tested generating my parquet files version to version from Polars 1.2 up until current, and I start getting these messages as of Polars 1.5.0's write_parquet specifically.

I believe something changed specifically in the write_parquet output that is causing it to no longer be compatible with the PBI Service's parquet connector in newer versions. I have analyzed the schema and the meta data and they are exactly the same in the old output versus new output.

Expected behavior

As nothing has changed in my schema or meta data, the files should be refreshing, but it seems like write_parquet's encoding is not recognized by PBI Service as of 1.5.0 onwards.

Installed versions

polars 1.5.0 to 1.7.1 (tested version by version)

The text was updated successfully, but these errors were encountered:

coastalwhite · 2024-09-19T09:10:21Z

My guess is that this has to do with the Boolean Hybrid-RLE encoding.

darrylthom · 2024-09-19T13:32:51Z

My guess is that this has to do with the Boolean Hybrid-RLE encoding.

Yes, this is it exactly. When I drop my boolean columns from my parquet file in the latest Polars, PBI Service refreshes the file successfully.

ritchie46 · 2024-09-23T09:31:06Z

It seems that the service only supports older parquet formats/encodings. For now you can circumvent the issue by writing via pyarrow which allows you to select different encodings.

This is something we could also support to a limited extend.

darrylthom · 2024-10-01T13:20:06Z

It seems that the service only supports older parquet formats/encodings. For now you can circumvent the issue by writing via pyarrow which allows you to select different encodings.

This is something we could also support to a limited extend.

Writing with pyarrow for the meantime worked. I tried creating an issue with the PowerBI team, but it got caught with their triaging vendor who was claiming it had to do with Polars and not Power BI so they wouldn't escalate it to the product team and recommended I downgrade instead.

s1lvester · 2024-11-23T15:42:33Z

I came across the same error. Here's how I worked around it:

import polars.selectors as cs
(
    df
    .with_columns(cs.boolean().cast(pl.Int8))
    .write_parquet("fname.parquet")
)```

darrylthom · 2024-11-24T14:12:16Z

I came across the same error. Here's how I worked around it:
import polars.selectors as cs
(
    df
    .with_columns(cs.boolean().cast(pl.Int8))
    .write_parquet("fname.parquet")
)```

I ended up using the following, as I wanted to keep booleans in their proper data type:

import pyarrow.parquet as pq
...
pq.write_table(df.to_arrow(), OUTPUT_PATH, use_dictionary=True) # use_dictionary I think is True by default and might not even be necessary for this to work on PBI (I did not test without it)

In fact, my file was compressing more than Polars when I tested writing compression='zstd' and compression_level=22 in both at the time.

Hopefully Polars adds a way to make itself compatible with PBI service when booleans are in the dataset in the future, as I almost guarantee based off of my awful experience dealing with their "support team" that they will not fix it on their end.

cBiscuitSurprise · 2024-12-05T15:47:19Z

This same issue breaks Amazon Redshift import from parquet files. Working around for now with pyarrow.

This is not really ready yet unless we have compatibility profiles. Fixes pola-rs#18819.

darrylthom added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 18, 2024

darrylthom changed the title ~~write_parquet encoding not recognized by PBI Service parquet connector after Polars 1.5.0 onwards~~ write_parquet encoding no longer recognized by PBI Service parquet connector after Polars 1.5.0 onwards Sep 18, 2024

coastalwhite added a commit to coastalwhite/polars that referenced this issue Dec 5, 2024

revert: Don't use RLE encoding for Parquet Boolean

4f2e03e

This is not really ready yet unless we have compatibility profiles. Fixes pola-rs#18819.

coastalwhite mentioned this issue Dec 5, 2024

fix: Don't use RLE encoding for Parquet Boolean #20172

Merged

ritchie46 closed this as completed in #20172 Dec 6, 2024

c-peters assigned coastalwhite Dec 8, 2024

c-peters added the accepted Ready for implementation label Dec 8, 2024

c-peters added this to Backlog Dec 8, 2024

c-peters moved this to Done in Backlog Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_parquet encoding no longer recognized by PBI Service parquet connector after Polars 1.5.0 onwards #18819

write_parquet encoding no longer recognized by PBI Service parquet connector after Polars 1.5.0 onwards #18819

darrylthom commented Sep 18, 2024 •

edited

Loading

coastalwhite commented Sep 19, 2024

darrylthom commented Sep 19, 2024

ritchie46 commented Sep 23, 2024

darrylthom commented Oct 1, 2024 •

edited

Loading

s1lvester commented Nov 23, 2024

darrylthom commented Nov 24, 2024 •

edited

Loading

cBiscuitSurprise commented Dec 5, 2024

write_parquet encoding no longer recognized by PBI Service parquet connector after Polars 1.5.0 onwards #18819

write_parquet encoding no longer recognized by PBI Service parquet connector after Polars 1.5.0 onwards #18819

Comments

darrylthom commented Sep 18, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

coastalwhite commented Sep 19, 2024

darrylthom commented Sep 19, 2024

ritchie46 commented Sep 23, 2024

darrylthom commented Oct 1, 2024 • edited Loading

s1lvester commented Nov 23, 2024

darrylthom commented Nov 24, 2024 • edited Loading

cBiscuitSurprise commented Dec 5, 2024

darrylthom commented Sep 18, 2024 •

edited

Loading

darrylthom commented Oct 1, 2024 •

edited

Loading

darrylthom commented Nov 24, 2024 •

edited

Loading