feat: Add existing parquet files #960

jonathanc-n · 2025-02-09T20:43:56Z

Completes #932

Allows for adding existing parquet files by using parquet metadata to create DataFiles. Then fast appending them using existing api

…ceberg-rust into add-existing-files

liurenjie1024

Thanks @jonathanc-n for this pr, this seems duplicated with fast_append api, is there anything missing in that api?

jonathanc-n · 2025-02-10T03:33:43Z

@liurenjie1024 I believe FastAppend is for directly appending DataFiles to the snapshot, whereas this pr takes existing parquet file paths, parses the Parquet metadata and converts it into DataFiles which are then fast appended. @ZENOTME Would you like to verify this?

liurenjie1024 · 2025-02-10T03:46:19Z

@liurenjie1024 I believe FastAppend is for directly appending DataFiles to the snapshot, whereas this pr takes existing parquet file paths, parses the Parquet metadata and converts it into DataFiles which are then fast appended. @ZENOTME Would you like to verify this?

I have concerns to put this into transaction api. It seems that what's necessary is to build a DataFile from exisitng parquet file, then user could call fast append to do it. But this is typically a dangerous operation because the schema in parquet is not verified. Is it possible to use FileWriter to write data to parquet in your case?

ZENOTME · 2025-02-10T04:41:34Z

@liurenjie1024 I believe FastAppend is for directly appending DataFiles to the snapshot, whereas this pr takes existing parquet file paths, parses the Parquet metadata and converts it into DataFiles which are then fast appended. @ZENOTME Would you like to verify this?

I have concerns to put this into transaction api. It seems that what's necessary is to build a DataFile from exisitng parquet file, then user could call fast append to do it. But this is typically a dangerous operation because the schema in parquet is not verified. Is it possible to use FileWriter to write data to parquet in your case?

This API is different from FastAppend:

FastAppend is used to append DataFile
This API is used to extract DataFile from existing file and append DataFile

In iceberg-python, it's a API in transaction: https://github.com/apache/iceberg-python/blob/dd175aadfdf03df707bed37008f217258a916369/pyiceberg/table/__init__.py#L671.
But interestingly, seems I can't find it in iceberg-java. cc @Fokko

But this is typically a dangerous operation because the schema in parquet is not verified. Is it possible to use FileWriter to write data to parquet in your case?

Yes, we should verify the schema. And it's also done in iceberg-python: https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L2431.

jonathanc-n · 2025-02-10T06:16:06Z

Yes, i will put this out for review after I add some schema validation. should be put out tomorrow

jonathanc-n · 2025-02-10T08:24:40Z

@ZENOTME Do you think the api should be kept in transactions or somewhere else?

ZENOTME · 2025-02-10T08:34:16Z

@ZENOTME Do you think the api should be kept in transactions or somewhere else?

It's ok to keep it in transaction seems it's a safe operation if we have schema validation and that's what iceberg-python do. Is there other concern for this? @liurenjie1024

Fokko · 2025-02-10T08:48:53Z

PyIceberg and Iceberg-Java are a bit different. Where PyIceberg is used by end-users, Iceberg-Java is often embedded in a query engine. I think this is the reason why it isn't part of the transaction API. Spark does have an add_files procedure.

To successfully be able to add files to a table, I think three things are essential:

Schema As already mentioned, the schema should be either the same or compatible. I would start with the first to make it simple and robust.
Name Mapping Since the Parquet probably doesn't contain field IDs for column tracking, we need to fall back on name-mapping.
Metrics When adding a file to the table, we should extract the upper-lower bound, number of nulls, etc from the Parquet footer and store it in the Iceberg metadata. This is important for Iceberg to maintain its promise of doing efficient scan's. Without this information, the file would always be included when planning a query.

jonathanc-n · 2025-02-10T19:29:45Z

@Fokko I was looking to do the name mapping and more metrics in another pr. Would you rather I include it in this one?

Fokko · 2025-02-10T21:37:56Z

@jonathanc-n Let's do that in a separate PR 👍

jonathanc-n · 2025-02-11T00:43:10Z

For metadata retrieval, it seems i can use ParquetWriter::to_data_file_builder to avoid duplicating more code. However the file metadata it takes in is the thrift FileMetadata , while the ArrowFileReader provides the parsed metadata which doesn't contain enough information. I was planning on submitting a pr in arrow-rs to allow ParquetMetadataReader to return the tfilemetadata format. Are there any other alternatives to this? cc @ZENOTME @Fokko @liurenjie1024

ZENOTME · 2025-02-11T06:25:22Z

while the ArrowFileReader provides the parsed metadata which doesn't contain enough information

Thanks for your investigation @jonathanc-n! But seems the parsed metadata contain enough information (such as statistics)

For metadata retrieval, it seems i can use ParquetWriter::to_data_file_builder to avoid duplicating more code.
I was planning on submitting a pr in arrow-rs to allow ParquetMetadataReader to return the tfilemetadata format.

I think the parsed format(memory representation) maybe more friendly for us to extract information. (Actually I will convert the thrift format to parsed format). So I think we should refine the to_data_file_builder to take the parsed format.🤔

ZENOTME · 2025-02-11T06:28:07Z

@Fokko I was looking to do the name mapping and more metrics in another pr. Would you rather I include it in this one?

def pyarrow_to_schema(
    schema: pa.Schema, name_mapping: Optional[NameMapping] = None, downcast_ns_timestamp_to_us: bool = False
) -> Schema:
    has_ids = visit_pyarrow(schema, _HasIds())
    if has_ids:
        return visit_pyarrow(schema, _ConvertToIceberg(downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us))
    elif name_mapping is not None:
        schema_without_ids = _pyarrow_to_schema_without_ids(schema, downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us)
        return apply_name_mapping(schema_without_ids, name_mapping)
    else:
        raise ValueError(
            "Parquet file does not have field-ids and the Iceberg table does not have 'schema.name-mapping.default' defined"
        )

For name map apply, it relies on the SchemaVisitorWithPartner and it will be introduced at #731. Maybe we can skip the schema without field id now and support name map apply after #731 complete.

…ceberg-rust into add-existing-files

jonathanc-n · 2025-02-12T01:31:12Z

I have changed the data file builder and reimplemented the original. I couldn't change the parameter passed to to_data_file_builder as the ParquetWriter returns the unparsed metadata. I can make a follow up pr for metadata validation.

liurenjie1024 · 2025-02-12T08:13:36Z

PyIceberg and Iceberg-Java are a bit different. Where PyIceberg is used by end-users, Iceberg-Java is often embedded in a query engine.

I think iceberg-rust's position is more like iceberg-java, since there already existing sql engines written in rust(datafusion, databend, polars, etc), and integrating them with iceberg-rust makes things easier. I think it's a good idea to have methods tto convert a parquet file to a data file, including works mentioned by @Fokko . But it's better to handle other things to query engines, since it involves io, parallelism management.

ZENOTME · 2025-02-12T12:46:24Z

PyIceberg and Iceberg-Java are a bit different. Where PyIceberg is used by end-users, Iceberg-Java is often embedded in a query engine.

I think iceberg-rust's position is more like iceberg-java, since there already existing sql engines written in rust(datafusion, databend, polars, etc), and integrating them with iceberg-rust makes things easier. I think it's a good idea to have methods tto convert a parquet file to a data file, including works mentioned by @Fokko . But it's better to handle other things to query engines, since it involves io, parallelism management.

So in summary, for this feature, what we would like to provide is a parquet_files_to_data_files in the arrow module(or create a parquet module). The parquet_files_to_data_files actually does two things:

schema compatibility check
metrics collection (we can derive a function data_file_statistics_from_parquet_metadata for this which can be reused in parquet file writer)

How do you think @liurenjie1024 @jonathanc-n @Fokko

jonathanc-n · 2025-02-12T17:59:39Z

@ZENOTME Yes, that souds fine to me. This pr contains the parquet files to data files. However I plan on making up a follow up pr with the schema compatability check

liurenjie1024 · 2025-02-13T08:37:47Z

So in summary, for this feature, what we would like to provide is a parquet_files_to_data_files in the arrow module(or create a parquet module). The parquet_files_to_data_files actually does two things:

schema compatibility check
metrics collection (we can derive a function data_file_statistics_from_parquet_metadata for this which can be reused in parquet file writer)

This sounds reasonable to me. After we have this function, could extend FastAppendAction to add existing parquet file.

crates/iceberg/src/transaction.rs

Fokko · 2025-02-13T13:12:24Z

crates/iceberg/src/transaction.rs

+                        "Partition field should only be a primitive type.",
+                    )
+                })?;
+                Ok(Some(primitive_type.type_to_literal()))


Looking at the type_to_literal function, I think this is correct. We don't want to set default values in the partition spec. I think there are two options:

Don't allow appending to partitioned tables for now and add logic to infer the partitions in a later PR.

Marking the data-file as unpartitioned

Yeah this makes sense, we can go with first option and add a todo on the check if partition exist.

liurenjie1024

Thanks @jonathanc-n for this pr, generally I'm fine with current direction, but I have some suggestions with code orgnization.

crates/iceberg/src/transaction.rs

liurenjie1024 · 2025-02-17T02:12:57Z

crates/iceberg/src/transaction.rs

+    }
+
+    /// `ParquetMetadata` to data file builder
+    pub fn parquet_to_data_file_builder(


I have some suggestion for this method:

It has a lot of duplicated with parquet file writer, we should reuse them.

This method should not be part of Transaction, it should be parquet module.

I think i mentioned the problem earlier with reusing the function here. Writer returns raw metadata which is what the original ParquetWriter::to_data_file_builder. In this case we are doing a read with the arrowfilereader which can only return the parsed metadata. This can be fixed if there is some conversion function to convert raw to parsed but I haven't seemed to be able to find one 🤔

liurenjie1024

Thanks @jonathanc-n , just finished first round of review, and left some suggestions to improve.

liurenjie1024 · 2025-02-18T07:27:00Z

crates/iceberg/src/transaction.rs

+        transaction: Transaction<'a>,
+        file_paths: Vec<String>,


Suggested change

transaction: Transaction<'a>,

file_paths: Vec<String>,

&mut self,

file_path: &str

As an api, I would suggest to ask user to provid one filename only so that user could do it concurrently.

I believe support for this should be added after #949. as of right now duplicate actions are not allowed

crates/iceberg/src/transaction.rs

crates/iceberg/src/writer/file_writer/mod.rs

crates/iceberg/src/writer/file_writer/parquet_writer.rs

liurenjie1024 · 2025-02-18T09:06:29Z

crates/iceberg/src/writer/file_writer/parquet_writer.rs

+    /// `ParquetMetadata` to data file builder
+    pub fn parquet_to_data_file_builder(
+        schema: SchemaRef,
+        metadata: Arc<ParquetMetaData>,


I don't get why we need to duplicate so much code with ParquetWriter::to_data_file_builder, how about simply extracting FileMetaData from ParquetMetaData and calling to_data_file_builder?

I'm a bit confused here, to clarify the FileMetadata in ParquetMetadata is different from the FileMetadata.

If I were to convert the ParquetMetadata to the filemetadata, I do not think the conversion overhead is worth it .

You are right, currently used FileMetadata is the auto generated data structure from thrift definition, while the FileMetadata I'm suggesting is an in memory representation of it. I think we should use the in memory representation in the original one.

Tracked in #1004

I can do that after merge. Anything else needs to be changed?

Co-authored-by: Renjie Liu <[email protected]>

jonathanc-n added 4 commits February 9, 2025 15:42

feat: Add existing parquet files

6853bab

Merge branch 'main' into add-existing-files

c574c5f

clippy fix

3f258ad

Merge branch 'add-existing-files' of https://github.com/jonathanc-n/i…

39ce23d

…ceberg-rust into add-existing-files

liurenjie1024 requested changes Feb 10, 2025

View reviewed changes

Merge branch 'main' into add-existing-files

dd4abb7

jonathanc-n marked this pull request as draft February 10, 2025 04:12

jonathanc-n mentioned this pull request Feb 10, 2025

Extract more metadata using ArrowFileReader #961

Closed

jonathanc-n added 5 commits February 11, 2025 19:41

Merge branch 'main' into add-existing-files

65abfc3

change data file builder

4cb5b1e

Merge branch 'add-existing-files' of https://github.com/jonathanc-n/i…

909d098

…ceberg-rust into add-existing-files

fmt fix

e1dd355

clippy fix

8756a71

jonathanc-n marked this pull request as ready for review February 12, 2025 01:29

Merge branch 'main' into add-existing-files

afbc642

Fokko reviewed Feb 13, 2025

View reviewed changes

crates/iceberg/src/transaction.rs Outdated Show resolved Hide resolved

Fokko reviewed Feb 13, 2025

View reviewed changes

crates/iceberg/src/transaction.rs Outdated Show resolved Hide resolved

Fokko reviewed Feb 13, 2025

View reviewed changes

switch to unpartitioned

a9c6b94

liurenjie1024 reviewed Feb 17, 2025

View reviewed changes

jonathanc-n added 2 commits February 17, 2025 20:40

code organization fixes

e01ead8

Merge branch 'main' into add-existing-files

96aabfe

liurenjie1024 reviewed Feb 18, 2025

View reviewed changes

jonathanc-n and others added 5 commits February 18, 2025 15:00

Update crates/iceberg/src/transaction.rs

b38496e

Co-authored-by: Renjie Liu <[email protected]>

Merge branch 'main' into add-existing-files

267124d

Merge branch 'main' into add-existing-files

33cac26

fixes

86ea8eb

clippy

c0cfb56

jonathanc-n mentioned this pull request Feb 23, 2025

Split transaction mod to multi files mod. #980

Open

jonathanc-n added 2 commits February 24, 2025 11:18

Merge branch 'main' into add-existing-files

175d4ce

Merge branch 'main' into add-existing-files

0c7caaa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add existing parquet files #960

feat: Add existing parquet files #960

jonathanc-n commented Feb 9, 2025

liurenjie1024 left a comment

jonathanc-n commented Feb 10, 2025

liurenjie1024 commented Feb 10, 2025

ZENOTME commented Feb 10, 2025

jonathanc-n commented Feb 10, 2025

jonathanc-n commented Feb 10, 2025

ZENOTME commented Feb 10, 2025

Fokko commented Feb 10, 2025

jonathanc-n commented Feb 10, 2025

Fokko commented Feb 10, 2025

jonathanc-n commented Feb 11, 2025 •

edited

Loading

ZENOTME commented Feb 11, 2025

ZENOTME commented Feb 11, 2025

jonathanc-n commented Feb 12, 2025 •

edited

Loading

liurenjie1024 commented Feb 12, 2025

ZENOTME commented Feb 12, 2025

jonathanc-n commented Feb 12, 2025

liurenjie1024 commented Feb 13, 2025

Fokko Feb 13, 2025

jonathanc-n Feb 13, 2025

liurenjie1024 left a comment

liurenjie1024 Feb 17, 2025

jonathanc-n Feb 17, 2025

liurenjie1024 left a comment

liurenjie1024 Feb 18, 2025

jonathanc-n Feb 21, 2025

liurenjie1024 Feb 18, 2025

jonathanc-n Feb 21, 2025

liurenjie1024 Feb 25, 2025

liurenjie1024 Feb 25, 2025

jonathanc-n Feb 25, 2025

feat: Add existing parquet files #960

Are you sure you want to change the base?

feat: Add existing parquet files #960

Conversation

jonathanc-n commented Feb 9, 2025

liurenjie1024 left a comment

Choose a reason for hiding this comment

jonathanc-n commented Feb 10, 2025

liurenjie1024 commented Feb 10, 2025

ZENOTME commented Feb 10, 2025

jonathanc-n commented Feb 10, 2025

jonathanc-n commented Feb 10, 2025

ZENOTME commented Feb 10, 2025

Fokko commented Feb 10, 2025

jonathanc-n commented Feb 10, 2025

Fokko commented Feb 10, 2025

jonathanc-n commented Feb 11, 2025 • edited Loading

ZENOTME commented Feb 11, 2025

ZENOTME commented Feb 11, 2025

jonathanc-n commented Feb 12, 2025 • edited Loading

liurenjie1024 commented Feb 12, 2025

ZENOTME commented Feb 12, 2025

jonathanc-n commented Feb 12, 2025

liurenjie1024 commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathanc-n commented Feb 11, 2025 •

edited

Loading

jonathanc-n commented Feb 12, 2025 •

edited

Loading