[HUDI-8990] Partition bucket index supports query pruning based on bucket id #13060

zhangyue19921010 · 2025-03-31T13:31:55Z

Change Logs

Followed PR #13017
Support Flink && Spark query adopt partition bucket index based buckId pruning

Impact

Flink && Spark query adopt partition bucket index based buckId pruning

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

zhangyue19921010 · 2025-03-31T16:10:52Z

@hudi-bot run azure

danny0405 · 2025-04-01T01:11:35Z

hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java

    } catch (Exception e) {
      throw new HoodieException("Failed to get hashing config instant to load.", e);
    }
  }

+  private static Option<String> getHashingConfigInstantToLoadBeforeOrOn(List<String> hashingConfigInstants, String instant) {


We should always load the latest bucket config to comply with the latest bucket id mappings. Otherwise the query would fail. We do not ensure snapshot isolation here because the reader read being affected by the writer.

The method getHashingConfigInstantToLoadBeforeOrOn is designed to handle Time Travel scenarios. For example, suppose we have a sequence of commits and replace-commits:

C1, C2, R3 (a bucket rescale operation), C4, R5 (another bucket rescale).

If we perform a Time Travel query targeting commit C4 (i.e., specifiedQueryInstant = C4), Maybe we should load the hashing configuration associated with R3 (the latest bucket rescale operation before or equal to C4).

Maybe we should load the hashing configuration associated with R3 (the latest bucket rescale operation before or equal to C4).

We can not do that because the data file layouts had been changed.

Yes, Danny. After Bucket Rescale is completed, the data layout will change. Therefore, for Spark's Time Travel，something like travel to specific time point snapshot view(https://hudi.apache.org/docs/sql_queries#time-travel-query),

Such as executing a query like SELECT * FROM <table_name> TIMESTAMP AS OF <instant1> WHERE <filter_conditions>, the Hudi would init specifiedQueryTimestamp through HoodieBaseRelation.

protected lazy val specifiedQueryTimestamp: Option[String] = optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key) .map(HoodieSqlCommonUtils.formatQueryInstant)

Then get Schema and build fsView based on specifiedQueryTimestamp

For constructing the FsView, Hudi will call getLatestMergedFileSlicesBeforeOrOn(String partitionStr, String maxInstantTime), travel fs view to the specified version. At this point, it is also necessary to load the corresponding hashing_config that was valid at that specific timestamp to ensure the historical data layout

protected def listLatestFileSlices(globPaths: Seq[StoragePath], partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[FileSlice] = { queryTimestamp match { case Some(ts) => specifiedQueryTimestamp.foreach(t => validateTimestampAsOf(metaClient, t)) val partitionDirs = if (globPaths.isEmpty) { fileIndex.listFiles(partitionFilters, dataFilters) } else { val inMemoryFileIndex = HoodieInMemoryFileIndex.create(sparkSession, globPaths) inMemoryFileIndex.listFiles(partitionFilters, dataFilters) } val fsView = new HoodieTableFileSystemView( metaClient, timeline, sparkAdapter.getSparkPartitionedFileUtils.toFileStatuses(partitionDirs) .map(fileStatus => HadoopFSUtils.convertToStoragePathInfo(fileStatus)) .asJava) fsView.getPartitionPaths.asScala.flatMap { partitionPath => val relativePath = getRelativePartitionPath(convertToStoragePath(basePath), partitionPath) fsView.getLatestMergedFileSlicesBeforeOrOn(relativePath, ts).iterator().asScala }.toSeq case _ => Seq() } }

For example

DeltaCommit1 ==> Write C1_File1, C1_File2
Bucket-Rescale Commit2 ==> Write C2_File1, C2_File2, C2_File3(Replaced C1_File1, C1_File2)
DeltaCommit 3 ==> Write C3_File1_Log1
Bucket-Rescale Commit4 ==> Write C4_File1(Replaced C2_File1, C2_File2, C2_File3 and C3_File1_Log1)

For Sql SELECT * FROM hudi_table TIMESTAMP AS OF <DeltaCommit 3>, we need to load Bucket-Rescale Commit2 instead of load latest hashing config Bucket-Rescale Commit4

Add a new Spark UT test("Test BucketID Pruning With Partition Bucket Index")
Without This PR will throw Exception

Expected Array([1111,3333.0,3333,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([1111,3333.0,3333,2021-01-05]), but got Array()

With PR in Always load latest hashing config logic, will throw exception

Expected Array([1111,2222.0,2222,2021-01-05]), but got Array() ScalaTestFailureLocation: org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase at (HoodieSparkSqlTestBase.scala:135) org.scalatest.exceptions.TestFailedException: Expected Array([1111,2222.0,2222,2021-01-05]), but got Array()

danny0405 · 2025-04-01T01:11:59Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java

    this.path = path;
    this.hadoopConf = HadoopConfigurations.getHadoopConf(conf);
    this.tableExists = StreamerUtil.tableExists(path.toString(), hadoopConf);
    this.metadataConfig = StreamerUtil.metadataConfig(conf);
    this.colStatsProbe = isDataSkippingFeasible(conf.get(FlinkOptions.READ_DATA_SKIPPING_ENABLED)) ? colStatsProbe : null;
    this.partitionPruner = partitionPruner;
-    this.dataBucket = dataBucket;


Can we still use the bucket id here?

evolve this dataBucket to a function

danny0405 · 2025-04-01T01:14:29Z

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java

@@ -45,7 +45,7 @@ public class PrimaryKeyPruners {

  public static final int BUCKET_ID_NO_PRUNING = -1;

-  public static int getBucketId(List<ResolvedExpression> hashKeyFilters, Configuration conf) {
+  public static int getBucketFieldHashing(List<ResolvedExpression> hashKeyFilters, Configuration conf) {


Can we still return the bucket id? We can add a new param for the function:

// the input of bucketIdFunc is the partition path getBucketId(List<ResolvedExpression> hashKeyFilters, Configuration conf, Function<String, Integer> bucketIdFunc)

Sorry Danny, I didn't get this. Is that possible to get full partition path and use it in this new bucketIdFunc In the original code's call site？

@Override public Result applyFilters(List<ResolvedExpression> filters) { List<ResolvedExpression> simpleFilters = filterSimpleCallExpression(filters); Tuple2<List<ResolvedExpression>, List<ResolvedExpression>> splitFilters = splitExprByPartitionCall(simpleFilters, this.partitionKeys, this.tableRowType); this.predicates = ExpressionPredicates.fromExpression(splitFilters.f0); this.columnStatsProbe = ColumnStatsProbe.newInstance(splitFilters.f0); this.partitionPruner = createPartitionPruner(splitFilters.f1, columnStatsProbe); this.dataBucket = getDataBucket(splitFilters.f0); // refuse all the filters now return SupportsFilterPushDown.Result.of(new ArrayList<>(splitFilters.f1), new ArrayList<>(filters)); }

What is PR did is get and pass hashing value to getFilesInPartitions, then compute numBuckets , finally compute the final bucket id hashing value % numBuckets

Maybe we evolve this dataBucket to a function (num_buckets_per_partition) -> (int)bucketId to make it somehow more flexible.

Sure, changed

danny0405 · 2025-04-01T01:16:09Z

...nk-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestPartitionBucketPruning.java

+import static org.hamcrest.MatcherAssert.assertThat;
+import static org.hamcrest.core.Is.is;
+
+public class TestPartitionBucketPruning {


Should we move the tests into TestHoodieTableSource, can we at least add a IT test for the query result validation.

change also add a IT named tesQueryWithPartitionBucketIndexPruning to do query result validation.

danny0405 · 2025-04-01T01:16:57Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

+
+  @transient private lazy val bucketIndexSupport = if (isPartitionSimpleBucketIndex) {
+    val specifiedQueryInstant = options.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key).map(HoodieSqlCommonUtils.formatQueryInstant)
+    new PartitionBucketIndexSupport(spark, metadataConfig, metaClient, specifiedQueryInstant)


always query from the latest hash config because there is no SI for reader/writers.

Same as TimeTravel scenarios mentioned above.

danny0405 · 2025-04-01T01:18:49Z

hudi-common/src/main/java/org/apache/hudi/common/model/PartitionBucketIndexHashingConfig.java

   */
-  public static String getLatestHashingConfigInstantToLoad(HoodieTableMetaClient metaClient) {
+  public static Option<String> getHashingConfigInstantToLoad(HoodieTableMetaClient metaClient, Option<String> instant) {
    try {
      List<String> allCommittedHashingConfig = getCommittedHashingConfigInstants(metaClient);


Can we assume there is always config files once the partition bucket index is enabled?

Yes we can.
Currently, we have two methods to use partition-level bucket indexing:

Enabling during table creation via DDL: When enabled during table creation, this method initializes a 0000000000000.hashing_config file through the catalog.

Upgrading existing table-level bucket indexes via the CALL command: For tables that already use table-level bucket indexing, invoking a CALL command triggers an upgrade process. This generates a replace-commit instant and initializes a corresponding .hashing_config file

danny0405 · 2025-04-03T08:42:39Z

...-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java

+    String insertInto = "insert into " + catalogName + ".hudi.hoodie_sink select * from csv_source";
+    execInsertSql(tableEnv, insertInto);
+
+    List<Row> result1 = CollectionUtil.iterableToList(


use execSelectSql(TableEnvironment tEnv, String select)

danny0405 · 2025-04-03T08:43:39Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java

@@ -69,7 +70,7 @@ public class FileIndex implements Serializable {
  private final org.apache.hadoop.conf.Configuration hadoopConf;
  private final PartitionPruners.PartitionPruner partitionPruner; // for partition pruning
  private final ColumnStatsProbe colStatsProbe;                   // for probing column stats
-  private final int dataBucketHashing;                                   // for bucket pruning
+  private final Option<Functions.Function1<Integer, Integer>> dataBucket;                                   // for bucket pruning


Can we use Java function, can we align the indentation of the comments:

// for partition pruning // for probing column stats // for bucket pruning

dataBucket -> dataBucketFunc

danny0405 · 2025-04-03T08:46:02Z

...link-datasource/hudi-flink/src/main/java/org/apache/hudi/source/prune/PrimaryKeyPruners.java

-  public static final int BUCKET_ID_NO_PRUNING = -1;
-
-  public static int getBucketFieldHashing(List<ResolvedExpression> hashKeyFilters, Configuration conf) {
+  public static Option<Functions.Function1<Integer, Integer>> getBucketId(List<ResolvedExpression> hashKeyFilters, Configuration conf) {


getBucketId -> getBucketIdFunc

It looks like the return value is always non-empty, can we just return the function instead of the option instead?

danny0405 · 2025-04-03T08:47:02Z