[HUDI-9248] Unify code paths for all write operations about `bulk_insert` #13066

TheR1sing3un · 2025-04-01T03:19:24Z

Change Logs

refactor: Unify code paths for all bulk_Insert to improve code readability and maintainability

Unify code paths for all bulk_Insert to improve code readability and maintainability
Using bucket_rescale instead of insert_overwrite to distinguish it from normal INSERT OVERWRITE

Impact

improve code readability and maintainability

Risk level (write none, low medium or high below)

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…ility and maintainability 1. Unify code paths for all bulk_Insert to improve code readability and maintainability 2. Using `bucket_rescale` instead of `insert_overwrite` to distinguish it from normal INSERT OVERWRITE Signed-off-by: TheR1sing3un <[email protected]>

danny0405 · 2025-04-01T05:13:23Z

...rc/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteTableCommitActionExecutor.java

-  }
-
-  @Override
-  protected Map<String, List<String>> getPartitionToReplacedFileIds(HoodieData<WriteStatus> writeStatuses) {


Isn't it better each executor has its own logic for this?

Isn't it better each executor has its own logic for this?

This method can be retained for future use by new Executors, but the logic of bulk_insert/insert_overwrite/insert_overwrite_table is fixed and can be consolidated into DataSourceInternalWriterHelper.
The execution and commit logic of current bulk_insert and insert_overwrite/insert_overwrite_table are not uniform, but in fact, the only difference between them in theory is that get replaced file ids. So I unified the execution and commit logic, only treating get replaced file ids differently

Isn't it better each executor has its own logic for this?

I I have retained this method, please review it again~

1. keep `getPartitionToReplacedFileIds` for scalability Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un · 2025-04-02T04:23:07Z

@hudi-bot run azure

danny0405 · 2025-04-02T05:27:13Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java

@@ -52,6 +52,12 @@ public class HoodieInternalConfig extends HoodieConfig {
      .markAdvanced()
      .withDocumentation("Inner configure to pass static partition paths to executors for SQL operations.");

+  public static final ConfigProperty<WriteOperationType> BULK_INSERT_WRITE_OPERATION_TYPE = ConfigProperty


Can we eliminate it, it does not make sense to introduce a sub-type and a default value is itself.

Can we eliminate it, it does not make sense to introduce a sub-type and a default value is itself.

Done~

…_TYPE 1. eliminate HoodieInternalConfig::BULK_INSERT_WRITE_OPERATION_TYPE Signed-off-by: TheR1sing3un <[email protected]>

hudi-bot · 2025-04-02T07:05:24Z

CI report:

eee0721 UNKNOWN
fbb4d3b UNKNOWN
a437567 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-04-03T01:47:49Z

...hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java

    } catch (Exception ioe) {
      throw new HoodieException(ioe.getMessage(), ioe);
    } finally {
      writeClient.close();
    }
  }

+  private Map<String, List<String>> getReplacedFileIds(List<WriteStatus> writeStatuses) {


Sorry, I still think keeping getReplacedFileIds in each executor is more cleaner and reasonable.

Sorry, I still think keeping getReplacedFileIds in each executor is more cleaner and reasonable.

Your suggestion is reasonable. However, since the current executor is only used to do the bulk_insert pre-preparation, and then the rest of the writes and commits are handed over to DataSourceInternalWriterHelper, executor::getReplaceFileIds are not called. Because executor is not responsible for committing, the getReplaceFileId method is not required. My original intention was just to converge the actual execution and commit code paths of bulk_insert/insert_overwrite/insert_overwrite_table together, instead of this: Some use HoodieDatasetBulkInsertHelper, some DataSourceInternalWriterHelper.
So do you have any good suggestions? Looking forward to your reply~

is only used to do the bulk_insert pre-preparation, and then the rest of the writes and commits are handed over to DataSourceInternalWriterHelper

Can we move the writes and commits inside the executors?

Can we move the writes and commits inside the executors?

Of course we can, then we don't need DataSourceInternalWriterHelper to perform writes and commits, all writes and commits are consolidated inside executor, and I'll make changes to follow that logic

+1 about Can we move the writes and commits inside the executors? It is better to let different executor to take care of their own replace logic instead of union together in a big switch or if-else

zhangyue19921010 · 2025-04-03T08:17:45Z

...rk-common/src/main/java/org/apache/hudi/commit/DatasetBucketRescaleCommitActionExecutor.java

@@ -73,10 +66,4 @@ protected void preExecute() {
    ValidationUtils.checkArgument(res);
    LOG.info("Finish to save hashing config " + hashingConfig);
  }
-
-  @Override
-  protected Map<String, List<String>> getPartitionToReplacedFileIds(HoodieData<WriteStatus> writeStatuses) {


we still need to overwrite this getPartitionToReplacedFileIds in DatasetBucketRescaleCommitActionExecutor ，Get rid of the influence of String staticOverwritePartition = config.getStringOrDefault(HoodieInternalConfig.STATIC_OVERWRITE_PARTITION_PATHS); parameter

zhangyue19921010 · 2025-04-03T08:27:17Z

...hudi-spark-common/src/main/java/org/apache/hudi/internal/DataSourceInternalWriterHelper.java

    } catch (Exception ioe) {
      throw new HoodieException(ioe.getMessage(), ioe);
    } finally {
      writeClient.close();
    }
  }

+  private Map<String, List<String>> getReplacedFileIds(List<WriteStatus> writeStatuses) {


+1 about Can we move the writes and commits inside the executors? It is better to let different executor to take care of their own replace logic instead of union together in a big switch or if-else

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Apr 1, 2025

TheR1sing3un force-pushed the unify_bulk_insert branch from fbb4d3b to 76a01e4 Compare April 1, 2025 03:52

danny0405 reviewed Apr 1, 2025

View reviewed changes

TheR1sing3un requested a review from danny0405 April 1, 2025 05:22

refactor: keep getPartitionToReplacedFileIds for scalability

08836df

1. keep `getPartitionToReplacedFileIds` for scalability Signed-off-by: TheR1sing3un <[email protected]>

danny0405 reviewed Apr 2, 2025

View reviewed changes

refactor: eliminate HoodieInternalConfig::BULK_INSERT_WRITE_OPERATION…

a437567

…_TYPE 1. eliminate HoodieInternalConfig::BULK_INSERT_WRITE_OPERATION_TYPE Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un requested a review from danny0405 April 2, 2025 08:01

danny0405 reviewed Apr 3, 2025

View reviewed changes

zhangyue19921010 reviewed Apr 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-9248] Unify code paths for all write operations about `bulk_insert` #13066

[HUDI-9248] Unify code paths for all write operations about `bulk_insert` #13066

TheR1sing3un commented Apr 1, 2025

danny0405 Apr 1, 2025

TheR1sing3un Apr 1, 2025

TheR1sing3un Apr 2, 2025

TheR1sing3un commented Apr 2, 2025

danny0405 Apr 2, 2025

TheR1sing3un Apr 2, 2025

hudi-bot commented Apr 2, 2025

danny0405 Apr 3, 2025

TheR1sing3un Apr 3, 2025

danny0405 Apr 3, 2025

TheR1sing3un Apr 3, 2025

zhangyue19921010 Apr 3, 2025

zhangyue19921010 Apr 3, 2025

zhangyue19921010 Apr 3, 2025

[HUDI-9248] Unify code paths for all write operations about bulk_insert #13066

Are you sure you want to change the base?

[HUDI-9248] Unify code paths for all write operations about bulk_insert #13066

Conversation

TheR1sing3un commented Apr 1, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheR1sing3un commented Apr 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Apr 2, 2025

CI report:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[HUDI-9248] Unify code paths for all write operations about `bulk_insert` #13066

[HUDI-9248] Unify code paths for all write operations about `bulk_insert` #13066