Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

bkief · 2024-07-27T04:41:30Z

Describe the bug, including details regarding any error messages, version, and platform.

The ValueError at

Lines 1360 to 1367 in aaeff72

    
               'DELTA_BYTE_ARRAY': ParquetEncoding_DELTA_BYTE_ARRAY, 
        
               'RLE_DICTIONARY': 'dict', 
        
               'PLAIN_DICTIONARY': 'dict', 
        
           }.get(encoding_name, None) 
        
           if enc is None: 
        
               raise ValueError(f"Unsupported column encoding: {encoding_name!r}") 
        
           elif enc == 'dict': 
        
               raise ValueError(f"{encoding_name!r} is already used by default.")

will raise anytime any column is custom encoded with a dictionary method. This makes it impossible to mix a dictionary encoded column with something like DELTA_BINARY_PACKED. I understand this is to prevent duplication of use_dictionary. Would it be okay to move this ValueError to the calling function instead? Does anything at the C++ level prevent this?

arrow/python/pyarrow/_parquet.pyx

Lines 1971 to 1972 in aaeff72

    
           elif isinstance(column_encoding, str): 
        
               props.encoding(encoding_enum_from_name(column_encoding))

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-07-27T06:12:43Z

I didn't fully catch what you mean, do you mean:

By default, using dictionary encoding
Using Delta as fallback encoding. If dictionary is too large, a row-group will fallback to use Delta

Am I right?

bkief added the Type: bug label Jul 27, 2024

github-actions bot added Component: Parquet Component: Python labels Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

bkief commented Jul 27, 2024

mapleFU commented Jul 27, 2024

Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

Mixing RLE_DICTIONARY and other column encodings in pyarrow parquet #43442

Comments

bkief commented Jul 27, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Jul 27, 2024