Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exportFlattenedVector does not support nested encoding and non-scalar types #9821

Closed
rui-mo opened this issue May 15, 2024 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@rui-mo
Copy link
Collaborator

rui-mo commented May 15, 2024

Description

In 9830814, flattenDictionary and flattenConstant are set as true for Parquet write, which relies on Bridge to convert Velox vector as Arrow array. When VectorFuzzer generates nested dictionary-encoded vector or non-scalar types, exporting to Arrow fails at below checks.

VELOX_CHECK(
vec.valueVector() == nullptr || vec.wrappedVector()->isFlatEncoding(),
"An unsupported nested encoding was found.");
VELOX_CHECK(vec.isScalar(), "Flattening is only supported for scalar types.");
VELOX_DYNAMIC_SCALAR_TYPE_DISPATCH(
flattenAndExport, vec.typeKind(), vec, rows, options, out, pool, holder);

@mbasmanova
Copy link
Contributor

CC: @Yuhta

@rui-mo Does this imply that ParquetWriter cannot create files for tables with columns of type array/map/struct?

@rui-mo
Copy link
Collaborator Author

rui-mo commented May 16, 2024

@mbasmanova I think only when the vector is dictionary-encoded, we cannot create Parquet for tables with complex types. If not, they are supported as below in Bridge.

case VectorEncoding::Simple::ROW:
exportRows(
*vec.asUnchecked<RowVector>(), rows, options, out, pool, *holder);
break;
case VectorEncoding::Simple::ARRAY:
exportArrays(
*vec.asUnchecked<ArrayVector>(), rows, options, out, pool, *holder);
break;
case VectorEncoding::Simple::MAP:
exportMaps(
*vec.asUnchecked<MapVector>(), rows, options, out, pool, *holder);
break;
case VectorEncoding::Simple::DICTIONARY:
options.flattenDictionary
? exportFlattenedVector(vec, rows, options, out, pool, *holder)
: exportDictionary(vec, rows, options, out, pool, *holder);
break;

@mbasmanova
Copy link
Contributor

@rui-mo If Parquet writer can handle all types, but only flat encodings, then we can simply flatten data before writing to Parquet in the Fuzzer.

@rui-mo
Copy link
Collaborator Author

rui-mo commented May 16, 2024

@mbasmanova Got it. I will try as you suggested. Thanks.

@rui-mo
Copy link
Collaborator Author

rui-mo commented Jul 2, 2024

This issue could be resolved by flattening data before writing into Parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants