Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Clean up Parquet generic and internal readers #12102

Merged
merged 7 commits into from
Jan 28, 2025

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jan 25, 2025

This is a refactor that cleans up a few issues I noticed while reviewing #11904 and while working on Parquet variant readers.

  • Updates INT and UINT handling to reject unsupported unsigned types (like UINT64)
  • Adds more factory methods to ParquetValueReaders to keep implementations private
  • Removes unused types argument from ParquetValueReader.StructReader subclasses. Some accessible uses had to be left, but most are removed. This allowed removing types from the reader builders as well.
  • Removes unnecessary TimeUnit param from factory methods added in Parquet: Add readers and writers for the internal object model #11904
  • Removes unnecessary implementations of setPageSource
  • Adds imports for LogicalTypeAnnotation classes to cut down on multi-line method signatures

Changes are grouped into separate commits for easier review.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is great, thanks for splitting into multiple commits, that made it easy to review. We'll need to update revAPI with the new ParquetValueReader.createStructReader interface which doesn't take in explicit types

Comment on lines 90 to 91
protected abstract ParquetValueReader<T> createStructReader(
List<Type> types, List<ParquetValueReader<?>> fieldReaders, Types.StructType structType);
List<ParquetValueReader<?>> fieldReaders, Types.StructType structType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RevAPI is failing due to the new abstract method that implementations will need to implement (which I understand the rationale, the previous types argument were not used). I think we'll need to add the breaking change to revAPI

./gradlew :iceberg-parquet:revapiAcceptBreak --justification "Implementations of ParquetValueReader.createStructReader should not have to pass in explicit types" \
          --code "java.method.abstractMethodAdded" \
          --new "method org.apache.iceberg.parquet.ParquetValueReader<T> org.apache.iceberg.data.parquet.BaseParquetReaders<T>::createStructReader(java.util.List<org.apache.iceberg.parquet.ParquetValueReader<?>>, org.apache.iceberg.types.Types.StructType)"

@ajantha-bhat
Copy link
Member

Thanks for cleaning this part of the code. I can work on moving parquet to core (already had a PR long back) after this to avoid conflicts for this PR.

@amogh-jahagirdar
Copy link
Contributor

Thanks @rdblue and @ajantha-bhat for reviewing!

@amogh-jahagirdar amogh-jahagirdar merged commit 8e456ae into apache:main Jan 28, 2025
46 checks passed
// this will throw an UnsupportedOperationException
return Optional.empty();
}
Preconditions.checkArgument(
Copy link
Member

@ajantha-bhat ajantha-bhat Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to throw the exception when it is signed right?

Now it throws when it is unsigned. check should be inverted?

Copy link
Member

@ajantha-bhat ajantha-bhat Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, got it. Based on the error message, we should not read unsigned. So, previous check was wrong!

While fixing nits, we actually fixed the check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have lot of gaps in test coverage for parquet readers like milli-time and timestamp reading, int96 reading, all kinds of signed, unsigned reads. I will create a ticket to follow this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants