Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parsing for ParquetWriterOptions #4693

Closed
4 tasks
alamb opened this issue Aug 13, 2023 · 6 comments
Closed
4 tasks

Implement parsing for ParquetWriterOptions #4693

alamb opened this issue Aug 13, 2023 · 6 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Aug 13, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We are implementing configurable parquet writing in DataFusion

We want to be able to allow users to specify the parquet writing options (like compression) via a string like

set parquet.writer_version = 2.0
set parquet.compression = zstd(5)

Describe the solution you'd like
Implement FromStr for the following structures, with some tests.

Each of these can be done via a separate PR

The basic code can probably be ported from DataFusion here and add some unit tests: https://github.com/apache/arrow-datafusion/blob/ed85abbb878ef3d60e43797376cb9a40955cd89a/datafusion/core/src/datasource/file_format/parquet.rs#L13

Bonus points for good error messages that give example values (like "Invalid encoding. Valid values: plain_dictionary, rle, etc

Describe alternatives you've considered

Additional context

@devinjdangelo implemented parsing for these in apache/datafusion#7244 however, I think these features could be more generally useful to others

@alamb alamb added good first issue Good for newcomers parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Aug 13, 2023
alamb added a commit to apache/datafusion that referenced this issue Aug 13, 2023
yjshen pushed a commit to apache/datafusion that referenced this issue Aug 13, 2023
@fansehep
Copy link
Contributor

Looks you want the last like this:

let writer_properties = WriterProperties::from_str("
  set parquet.writer_version = 2.0
  set parquet.compression = zstd(5)").unwrap();

I want to try it. and should we really need the prefix word 'set' and parquet.* ?

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2023

I want to try it.

Hi @fansehep -- thank you.

should we really need the prefix word 'set' and parquet.* ?

No. I am sorry for the confusion, the set ... terminology is from DataFusion. This ticket in arrow-rs only covers the parsing of the values

so for example, we would implement FromStr for https://docs.rs/parquet/45.0.0/parquet/format/struct.Encoding.html would allow something like this:

// call `parse` to implement FromStr
let encoding: Encoding = "PLAIN".parse().unwrap();
// call parse again, case insensitive
let encoding: Encoding = "plain".parse().unwrap();

@fansehep
Copy link
Contributor

so for example, we would implement FromStr for https://docs.rs/parquet/45.0.0/parquet/format/struct.Encoding.html would allow something like this:例如,我们为 https://docs.rs/parquet/45.0.0/parquet/format/struct.Encoding.html 实现 FromStr 将允许这样的事情:

// call `parse` to implement FromStr
let encoding: Encoding = "PLAIN".parse().unwrap();
// call parse again, case insensitive
let encoding: Encoding = "plain".parse().unwrap();

Thanks for your help. 😃

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2023

Thanks for your help. 😃

Thank YOU!

@raulcd
Copy link
Member

raulcd commented Nov 28, 2024

This one seems to be finished by:

@alamb is there anything missing from what you were expecting? Can we close it?

@alamb
Copy link
Contributor Author

alamb commented Dec 3, 2024

I agree this looks reasonable to me.

Thanks @raulcd and @fansehep

@alamb alamb closed this as completed Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants