Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

Open
bkief opened this issue Jul 27, 2024 · 1 comment
Open

Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

bkief opened this issue Jul 27, 2024 · 1 comment

Comments

@bkief
Copy link
Contributor

bkief commented Jul 27, 2024

Describe the bug, including details regarding any error messages, version, and platform.

The ValueError at

'DELTA_BYTE_ARRAY': ParquetEncoding_DELTA_BYTE_ARRAY,
'RLE_DICTIONARY': 'dict',
'PLAIN_DICTIONARY': 'dict',
}.get(encoding_name, None)
if enc is None:
raise ValueError(f"Unsupported column encoding: {encoding_name!r}")
elif enc == 'dict':
raise ValueError(f"{encoding_name!r} is already used by default.")
will raise anytime any column is custom encoded with a dictionary method. This makes it impossible to mix a dictionary encoded column with something like DELTA_BINARY_PACKED. I understand this is to prevent duplication of use_dictionary. Would it be okay to move this ValueError to the calling function instead? Does anything at the C++ level prevent this?
elif isinstance(column_encoding, str):
props.encoding(encoding_enum_from_name(column_encoding))

Component(s)

Parquet, Python

@mapleFU
Copy link
Member

mapleFU commented Jul 27, 2024

I didn't fully catch what you mean, do you mean:

  1. By default, using dictionary encoding
  2. Using Delta as fallback encoding. If dictionary is too large, a row-group will fallback to use Delta

Am I right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants