[HUDI-9205] Introduce a representative file containing the estimated total size of file slice #13070

TheR1sing3un · 2025-04-02T05:21:57Z

close to: #12139

When dealing with HadoopFsRelation, Spark merges PartitionedFile based on data such as file size. At present, we directly use the base file or a random log file as the PartitionedFile of the FileSlice. As a result, spark cannot accurately use representative data when merging. Therefore, I estimated the size of the entire FileSlice file if it is converted into parqeut file. Using this data to represent the file slice can provide more accurate data for spark to optimize.

Change Logs

Introduce a representative file containing the estimated total size of file slice

Impact

Reduces task tilt when reading

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java

danny0405 · 2025-04-02T06:14:30Z

...he/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

            .getSparkPartitionedFileUtils.getPathFromPartitionedFile(file))
-          fileSliceMapping.getSlice(filegroupName) match {
-            case Some(fileSlice) if !isCount && (requiredSchema.nonEmpty || fileSlice.getLogFiles.findAny().isPresent) =>


why remove the requiredSchema.nonEmpty check.

why remove the requiredSchema.nonEmpty check.

This check has not been removed

why remove the requiredSchema.nonEmpty check.

Hi, I have rolled back these changes, I have kept the original logic as much as possible, and only changed the logic for creating representative files

TheR1sing3un · 2025-04-02T14:10:11Z

@hudi-bot run azure

danny0405 · 2025-04-03T02:05:05Z

...he/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala

@@ -178,8 +183,8 @@ class HoodieFileGroupReaderBasedParquetFileFormat(tablePath: String,
                internalSchemaOpt,
                metaClient,
                props,
-                file.start,
-                file.length,
+                0,


can you elaborate this change?

can you elaborate this change?

These two arguments are provided to the file group reader to tell it the start location and length of the base file. This value used to be taken directly from the `PartitiondFile` because when there is a base file in the file slice, The actual size of the base file is used as the length of the `PartitiondFile`. Now the `PartitiondFile` is a representative file of the file slice and is not the length of the actual base file, so we need to get the actual length out of the base file

then why the start is constant 0?

then why the start is constant 0?

Please see the latest commit, start is still file.start

…size of file slice 1. Introduce a representative file containing the estimated total size of file slice NOTE: When dealing with `HadoopFsRelation`, Spark merges `PartitionedFile` based on data such as file size. At present, we directly use the base file or a random log file as the `PartitionedFile` of the `FileSlice`. As a result, spark cannot accurately use representative data when merging. Therefore, I estimated the size of the entire `FileSlice` file if it is converted into parqeut file. Using this data to represent the file slice can provide more accurate data for spark to optimize. Signed-off-by: TheR1sing3un <[email protected]>

1. fix scala check style Signed-off-by: TheR1sing3un <[email protected]>

…nd any log files 1. only read base file if file slice only has base file but not found any log files Signed-off-by: TheR1sing3un <[email protected]>

…the representative file size calculation logic 1. Retain the original logic to the maximum extent and modify only the representative file size calculation logic Signed-off-by: TheR1sing3un <[email protected]>

…he estimated proportions 1. Use the configuration in HoodieStorageConfig to calculate the estimated proportions Signed-off-by: TheR1sing3un <[email protected]>

yihua · 2025-04-04T20:45:36Z

hudi-common/src/main/java/org/apache/hudi/common/util/FileSliceUtils.java

+        : logFileSize;
+  }
+
+  private static long convertLogFilesSizeToExpectedParquetSize(List<HoodieLogFile> hoodieLogFiles, double logFileFraction) {


Log format can be Avro, Parquet, or HFile. Could this be generalized instead of being fixated on Parquet?

Log format can be Avro, Parquet, or HFile. Could this be generalized instead of being fixated on Parquet?

Now it is not fixed, logFileFraction comes from HoodieStorageConfig#LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION, users can change this value according to their own log format and the actual data pattern

yihua · 2025-04-04T20:45:49Z

hudi-common/src/main/java/org/apache/hudi/common/util/FileSliceUtils.java

+   * Get the total file size of a file slice in parquet format.
+   * For the log file, we need to convert its size to the estimated size in the parquet format in a certain proportion
+   */
+  public static long getTotalFileSizeAsParquetFormat(FileSlice fileSlice, double logFileFraction) {


Similar on the base file.

Similar on the base file.

done~

yihua · 2025-04-04T20:47:46Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

-          }).filter(slice => slice != null)
-            .map(fileInfo => new FileStatus(fileInfo.getLength, fileInfo.isDirectory, 0, fileInfo.getBlockSize,
-              fileInfo.getModificationTime, new Path(fileInfo.getPath.toUri)))
+            new FileStatus(estimationFileSize, fileInfo.isDirectory, 0, fileInfo.getBlockSize, fileInfo.getModificationTime, new Path(fileInfo.getPath.toUri))


Does this affect reading the file because the file status is manipulated? Is there a different way of letting Spark know the file / partitioned file size estimation?

Does this affect reading the file because the file status is manipulated? Is there a different way of letting Spark know the file / partitioned file size estimation?

The actual read process is also controlled by hudi code. spark only does some parallelism related optimization through this file status, and the actual read is still controlled by hudi.

danny0405 · 2025-04-05T01:10:22Z

...rc/main/java/org/apache/hudi/table/action/deltacommit/SparkUpsertDeltaCommitPartitioner.java

@@ -73,13 +74,13 @@ protected List<SmallFile> getSmallFiles(String partitionPath) {
      if (smallFileSlice.getBaseFile().isPresent()) {
        HoodieBaseFile baseFile = smallFileSlice.getBaseFile().get();
        sf.location = new HoodieRecordLocation(baseFile.getCommitTime(), baseFile.getFileId());
-        sf.sizeBytes = getTotalFileSize(smallFileSlice);
+        sf.sizeBytes = FileSliceUtils.getTotalFileSizeAsParquetFormat(smallFileSlice, config.getLogFileToParquetCompressionRatio());


Can be moved to class FileSlice, and keep the comment of method: convertLogFilesSizeToExpectedParquetSize.

Can be moved to class FileSlice, and keep the comment of method: convertLogFilesSizeToExpectedParquetSize.

done~

…FileSlice::getTotalFileSizeSimilarIOnBaseFile` 1. move `FileSliceUtils::getTotalFileSizeAsParquetFormat` to `FileSlice::getTotalFileSizeSimilarIOnBaseFile` Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un · 2025-04-07T11:51:11Z

@hudi-bot run azure

TheR1sing3un · 2025-04-07T13:27:54Z

@hudi-bot run azure

hudi-bot · 2025-04-07T15:35:48Z

CI report:

91837a8 UNKNOWN
9372b4b Azure: FAILURE Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

TheR1sing3un · 2025-04-10T03:50:11Z

@danny0405 @yihua Ready for review again~

danny0405 · 2025-04-11T00:00:14Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

-            if (slice.getBaseFile.isPresent) {
+          val logFileEstimationFraction = options.getOrElse(HoodieStorageConfig.LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION.key(),
+            HoodieStorageConfig.LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION.defaultValue()).toDouble
+          // 1. Generate a disguised representative file for each file slice, which spark uses to optimize rdd partition parallelism based on data such as file size


disguised -> delegate?

danny0405 · 2025-04-11T00:00:36Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+          // 1. Generate a disguised representative file for each file slice, which spark uses to optimize rdd partition parallelism based on data such as file size
+          // For file slice only has base file, we directly use the base file size as representative file size
+          // For file slice has log file, we estimate the representative file size based on the log file size and option(base file) size
+          val representFiles = fileSlices.map(slice => {


delegateFiles

danny0405 · 2025-04-11T00:01:10Z

hudi-common/src/main/java/org/apache/hudi/common/model/FileSlice.java

+    return getBaseFile().isPresent() ? getBaseFile().get().getFileSize() + logFileSize : logFileSize;
+  }
+
+  private long convertLogFilesSizeSimilarOnBaseFile(double logFileFraction) {


still think the original method name is better.

danny0405

@TheR1sing3un Thanks for the contribution, do we have some diagram to illustrage the files balance changes before/after the patch?

github-actions bot added the size:M PR with lines of changes in (100, 300] label Apr 2, 2025

danny0405 reviewed Apr 2, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieReaderConfig.java Outdated Show resolved Hide resolved

danny0405 reviewed Apr 2, 2025

View reviewed changes

TheR1sing3un requested a review from danny0405 April 2, 2025 09:29

danny0405 reviewed Apr 3, 2025

View reviewed changes

TheR1sing3un added 5 commits April 3, 2025 10:24

style: fix scala check style

d34bc76

1. fix scala check style Signed-off-by: TheR1sing3un <[email protected]>

fix: only read base file if file slice only has base file but not fou…

54ca7f2

…nd any log files 1. only read base file if file slice only has base file but not found any log files Signed-off-by: TheR1sing3un <[email protected]>

fix: Retain the original logic to the maximum extent and modify only …

87923ec

…the representative file size calculation logic 1. Retain the original logic to the maximum extent and modify only the representative file size calculation logic Signed-off-by: TheR1sing3un <[email protected]>

refactor: Use the configuration in HoodieStorageConfig to calculate t…

ad42cec

…he estimated proportions 1. Use the configuration in HoodieStorageConfig to calculate the estimated proportions Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un force-pushed the optimize_slice_estimation branch from e522af2 to ad42cec Compare April 3, 2025 02:24

yihua reviewed Apr 4, 2025

View reviewed changes

danny0405 reviewed Apr 5, 2025

View reviewed changes

refactor: move FileSliceUtils::getTotalFileSizeAsParquetFormat to `…

9372b4b

…FileSlice::getTotalFileSizeSimilarIOnBaseFile` 1. move `FileSliceUtils::getTotalFileSizeAsParquetFormat` to `FileSlice::getTotalFileSizeSimilarIOnBaseFile` Signed-off-by: TheR1sing3un <[email protected]>

TheR1sing3un requested review from yihua and danny0405 April 8, 2025 07:56

danny0405 reviewed Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-9205] Introduce a representative file containing the estimated total size of file slice #13070

[HUDI-9205] Introduce a representative file containing the estimated total size of file slice #13070

TheR1sing3un commented Apr 2, 2025

danny0405 Apr 2, 2025

TheR1sing3un Apr 2, 2025

TheR1sing3un Apr 2, 2025

TheR1sing3un commented Apr 2, 2025

danny0405 Apr 3, 2025

TheR1sing3un Apr 3, 2025

danny0405 Apr 3, 2025

TheR1sing3un Apr 3, 2025

yihua Apr 4, 2025

TheR1sing3un Apr 7, 2025

yihua Apr 4, 2025

TheR1sing3un Apr 7, 2025

yihua Apr 4, 2025

TheR1sing3un Apr 7, 2025

danny0405 Apr 5, 2025

TheR1sing3un Apr 7, 2025

TheR1sing3un commented Apr 7, 2025

TheR1sing3un commented Apr 7, 2025

hudi-bot commented Apr 7, 2025

TheR1sing3un commented Apr 10, 2025

danny0405 Apr 11, 2025

danny0405 Apr 11, 2025

danny0405 Apr 11, 2025

danny0405 left a comment

[HUDI-9205] Introduce a representative file containing the estimated total size of file slice #13070

Are you sure you want to change the base?

[HUDI-9205] Introduce a representative file containing the estimated total size of file slice #13070

Conversation

TheR1sing3un commented Apr 2, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheR1sing3un commented Apr 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheR1sing3un commented Apr 7, 2025

TheR1sing3un commented Apr 7, 2025

hudi-bot commented Apr 7, 2025

CI report:

TheR1sing3un commented Apr 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 left a comment

Choose a reason for hiding this comment