-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add existing parquet files #960
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanc-n for this pr, this seems duplicated with fast_append api, is there anything missing in that api?
@liurenjie1024 I believe FastAppend is for directly appending |
I have concerns to put this into transaction api. It seems that what's necessary is to build a |
This API is different from FastAppend:
In iceberg-python, it's a API in transaction: https://github.com/apache/iceberg-python/blob/dd175aadfdf03df707bed37008f217258a916369/pyiceberg/table/__init__.py#L671.
Yes, we should verify the schema. And it's also done in iceberg-python: https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L2431. |
Yes, i will put this out for review after I add some schema validation. should be put out tomorrow |
@ZENOTME Do you think the api should be kept in transactions or somewhere else? |
It's ok to keep it in transaction seems it's a safe operation if we have schema validation and that's what iceberg-python do. Is there other concern for this? @liurenjie1024 |
PyIceberg and Iceberg-Java are a bit different. Where PyIceberg is used by end-users, Iceberg-Java is often embedded in a query engine. I think this is the reason why it isn't part of the transaction API. Spark does have an To successfully be able to add files to a table, I think three things are essential:
|
@Fokko I was looking to do the name mapping and more metrics in another pr. Would you rather I include it in this one? |
@jonathanc-n Let's do that in a separate PR 👍 |
For metadata retrieval, it seems i can use |
Thanks for your investigation @jonathanc-n! But seems the parsed metadata contain enough information (such as statistics)
I think the parsed format(memory representation) maybe more friendly for us to extract information. (Actually I will convert the thrift format to parsed format). So I think we should refine the to_data_file_builder to take the parsed format.🤔 |
For name map apply, it relies on the SchemaVisitorWithPartner and it will be introduced at #731. Maybe we can skip the schema without field id now and support name map apply after #731 complete. |
I have changed the data file builder and reimplemented the original. I couldn't change the parameter passed to |
I think iceberg-rust's position is more like iceberg-java, since there already existing sql engines written in rust(datafusion, databend, polars, etc), and integrating them with iceberg-rust makes things easier. I think it's a good idea to have methods tto convert a parquet file to a data file, including works mentioned by @Fokko . But it's better to handle other things to query engines, since it involves io, parallelism management. |
So in summary, for this feature, what we would like to provide is a
How do you think @liurenjie1024 @jonathanc-n @Fokko |
@ZENOTME Yes, that souds fine to me. This pr contains the parquet files to data files. However I plan on making up a follow up pr with the schema compatability check |
This sounds reasonable to me. After we have this function, could extend |
crates/iceberg/src/transaction.rs
Outdated
"Partition field should only be a primitive type.", | ||
) | ||
})?; | ||
Ok(Some(primitive_type.type_to_literal())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the type_to_literal
function, I think this is correct. We don't want to set default values in the partition spec. I think there are two options:
- Don't allow appending to partitioned tables for now and add logic to infer the partitions in a later PR.
- Marking the data-file as unpartitioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes sense, we can go with first option and add a todo on the check if partition exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanc-n for this pr, generally I'm fine with current direction, but I have some suggestions with code orgnization.
crates/iceberg/src/transaction.rs
Outdated
} | ||
|
||
/// `ParquetMetadata` to data file builder | ||
pub fn parquet_to_data_file_builder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some suggestion for this method:
- It has a lot of duplicated with parquet file writer, we should reuse them.
- This method should not be part of
Transaction
, it should be parquet module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i mentioned the problem earlier with reusing the function here. Writer returns raw metadata which is what the original ParquetWriter::to_data_file_builder
. In this case we are doing a read with the arrowfilereader which can only return the parsed metadata. This can be fixed if there is some conversion function to convert raw to parsed but I haven't seemed to be able to find one 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jonathanc-n , just finished first round of review, and left some suggestions to improve.
crates/iceberg/src/transaction.rs
Outdated
transaction: Transaction<'a>, | ||
file_paths: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transaction: Transaction<'a>, | |
file_paths: Vec<String>, | |
&mut self, | |
file_path: &str |
As an api, I would suggest to ask user to provid one filename only so that user could do it concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe support for this should be added after #949. as of right now duplicate actions are not allowed
/// `ParquetMetadata` to data file builder | ||
pub fn parquet_to_data_file_builder( | ||
schema: SchemaRef, | ||
metadata: Arc<ParquetMetaData>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need to duplicate so much code with ParquetWriter::to_data_file_builder
, how about simply extracting FileMetaData
from ParquetMetaData
and calling to_data_file_builder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here, to clarify the FileMetadata
in ParquetMetadata
is different from the FileMetadata
.
If I were to convert the ParquetMetadata to the filemetadata, I do not think the conversion overhead is worth it .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, currently used FileMetadata
is the auto generated data structure from thrift definition, while the FileMetadata
I'm suggesting is an in memory representation of it. I think we should use the in memory representation in the original one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in #1004
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that after merge. Anything else needs to be changed?
Completes #932
Allows for adding existing parquet files by using parquet metadata to create
DataFiles
. Then fast appending them using existing api