[HUDI-9147] Support HoodieFileGroupReader for Flink and use FileGroup reader in compaction #13078

cshuo · 2025-04-03T08:58:23Z

Change Logs

Implement FileGroup reader for Flink
Support Flink compaction use FileGroup reader

Impact

Improve perf for Flink compaction

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

cshuo · 2025-04-04T00:31:56Z

cc @danny0405 PTAL, thks

danny0405 · 2025-04-04T02:30:38Z

...hudi-flink-client/src/main/java/org/apache/hudi/client/common/FlinkRowDataReaderContext.java

+
+  @Override
+  public String getRecordKey(RowData record, Schema schema) {
+    return Objects.toString(getValue(record, schema, RECORD_KEY_METADATA_FIELD));


is the record key metadata always there in the row data?

Here just follows the current FG reader based compaction in HoodieCompactor:

hudi/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java

Line 152 in 45dedd8

&& config.populateMetaFields(); // Virtual key support by fg reader is not ready

i.e., one of the prerequisites for FG reader based compaction is populateMetaFields is enabled.

danny0405 · 2025-04-04T03:00:15Z

...hudi-flink-client/src/main/java/org/apache/hudi/client/common/FlinkRowDataReaderContext.java

+        (String) metadataMap.get(INTERNAL_META_PARTITION_PATH));
+    // delete record
+    if (recordOption.isEmpty()) {
+      return new HoodieEmptyRecord<>(hoodieKey, HoodieRecord.HoodieRecordType.FLINK);


do we need to set up the ordering value correctly.

Nice catch, will update.

danny0405 · 2025-04-04T03:04:00Z

...hudi-flink-client/src/main/java/org/apache/hudi/client/common/FlinkRowDataReaderContext.java

+  public RowData seal(RowData rowData) {
+    if (rowDataSerializer == null) {
+      RowType requiredRowType = (RowType) AvroSchemaConverter.convertToDataType(getSchemaHandler().getRequiredSchema()).getLogicalType();
+      rowDataSerializer = new RowDataSerializer(requiredRowType);


do we need to cache the serializer.

The serializer here is not created at record level, it's a member field for the FlinkRowDataReaderContext

...ient/hudi-flink-client/src/main/java/org/apache/hudi/client/model/AbstractHoodieRowData.java

danny0405 · 2025-04-04T03:11:05Z

hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/model/HoodieFlinkRecord.java

+      }
+    }
+
+    AbstractHoodieRowData rowWithMetaFields = HoodieRowDataCreation.create(metaFields, data, withOperationField, withMetaFields);


Should we just store StringData for AbstractHoodieRowData metadata fields.

We can do that for memory efficiency.

danny0405 · 2025-04-04T05:06:48Z

hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieReaderContext.java

@@ -309,7 +309,7 @@ public Map<String, Object> generateMetadataForRecord(
   * @param schema The Avro schema of the record.
   * @return A mapping containing the metadata.
   */
-  public Map<String, Object> generateMetadataForRecord(T record, Schema schema) {
+  public Map<String, Object> generateMetadataForRecord(T record, Schema schema, Option<String> orderingFieldName) {


why this change?

orderingFieldName is added to generate ordering value in FlinkRowDataReaderContext.

generateMetadataForRecord only generates recordKey by default, and the generated metadata map will be used to construct HoodieFlinkRecord in constructHoodieRecord(Option<RowData> recordOption, Map<String, Object> metadataMap), where ordering value is necessary.

danny0405 · 2025-04-04T05:16:03Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

@@ -138,6 +139,15 @@ private FlinkOptions() {
          + "These merger impls will filter by record.merger.strategy. "
          + "Hudi will pick most efficient implementation to perform merging/combining of the records (during update, reading MOR table, etc)");

+  @AdvancedConfig
+  public static final ConfigOption<String> RECORD_MERGE_MODE = ConfigOptions


No need to add the option for Hoodie core options, all the hoodie options are applied automically.

Currently, users can configure merging strategy by the option payload.class (default value is EventTimeAvroPayload for event time merging semantics), which is Avro-based.

After we introduce FG reader based compaction, users should not use the legacy config based on Avro payload, and the merging mode configs should be exposed to users to choose the expected merging behavior.

Btw, the compatibility work for payload config is also included in in the PR.

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/schema/FilebasedSchemaProvider.java

...datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java

danny0405 · 2025-04-04T05:39:21Z

...src/main/java/org/apache/hudi/table/action/compact/HoodieFlinkMergeOnReadTableCompactor.java

+                                   String instantTime,
+                                   Option<EngineBroadcastManager> broadcastManagerOpt) throws IOException {
+    Configuration conf = metaClient.getStorage().getConf().unwrapAs(Configuration.class);
+    FlinkRowDataReaderContext readerContext = new FlinkRowDataReaderContext(


Looks like there is no need to store Flink conf in FlinkRowDataReaderContext, the InternalSchemaManager needs the Flink conf to do 2 things:

decide if the schema evolution is enabled: we can move this check into HoodieWriteConfig.isSchemaEvolutionEnabled;

generates Hadoop conf from it, but the hadoop conf is already in FlinkRowDataReaderContext.

Besides the usage you mentioned, flink conf stored in FlinkRowDataReaderContext is also used to:

generate partition specs for FlinkParquetReader;

get config read.utc-timezone to create field converter (Flink value -> Avro value) in getOrderingValue;

Any recommended cleaner way to achieve these?

cshuo · 2025-04-05T14:55:25Z

Java CI / validate-bundles (scala-2.12, flink1.14, 1.10.0, spark3.3, spark3.3.4) failed because hudi-flink 1.14 bundle is running with flink-1.15 docker env, and there is some class compatibility problem, will see how to fix this.

… reader in compaction

fix test update

hudi-bot · 2025-04-06T03:17:43Z

CI report:

a7b5bf3 UNKNOWN
72eb82f UNKNOWN
8ff3056 UNKNOWN
84167aa Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 3, 2025

cshuo force-pushed the HUDI-9147 branch 6 times, most recently from d65a9ab to 1fb19df Compare April 3, 2025 13:53

danny0405 reviewed Apr 4, 2025

View reviewed changes

...ient/hudi-flink-client/src/main/java/org/apache/hudi/client/model/AbstractHoodieRowData.java Outdated Show resolved Hide resolved

danny0405 reviewed Apr 4, 2025

View reviewed changes

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/schema/FilebasedSchemaProvider.java Outdated Show resolved Hide resolved

danny0405 reviewed Apr 4, 2025

View reviewed changes

...datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java Outdated Show resolved Hide resolved

danny0405 reviewed Apr 4, 2025

View reviewed changes

cshuo added 6 commits April 6, 2025 08:48

[HUDI-9147] Support HoodieFileGroupReader for Flink and use FileGroup…

05d6854

… reader in compaction

support flink compact using file group reader

e4af020

fix test update

support partial update record merger

cadccdc

keep legacy merge config compatible

3d31979

fix fg read schema evolution

f3769e3

fix comments

84167aa

cshuo force-pushed the HUDI-9147 branch from 57ff068 to 84167aa Compare April 6, 2025 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-9147] Support HoodieFileGroupReader for Flink and use FileGroup reader in compaction #13078

[HUDI-9147] Support HoodieFileGroupReader for Flink and use FileGroup reader in compaction #13078

cshuo commented Apr 3, 2025

cshuo commented Apr 4, 2025

danny0405 Apr 4, 2025

cshuo Apr 5, 2025

danny0405 Apr 4, 2025

cshuo Apr 5, 2025

danny0405 Apr 4, 2025

cshuo Apr 5, 2025

danny0405 Apr 4, 2025

cshuo Apr 5, 2025

danny0405 Apr 4, 2025

cshuo Apr 4, 2025

danny0405 Apr 4, 2025

cshuo Apr 5, 2025

danny0405 Apr 4, 2025

cshuo Apr 5, 2025

cshuo commented Apr 5, 2025

hudi-bot commented Apr 6, 2025

[HUDI-9147] Support HoodieFileGroupReader for Flink and use FileGroup reader in compaction #13078

Are you sure you want to change the base?

[HUDI-9147] Support HoodieFileGroupReader for Flink and use FileGroup reader in compaction #13078

Conversation

cshuo commented Apr 3, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

cshuo commented Apr 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cshuo commented Apr 5, 2025

hudi-bot commented Apr 6, 2025

CI report: