hudi0.13.1 #13021

hanleiit · 2025-03-24T07:47:15Z

Change Logs

Describe context and summary for this change. Highlight if any code was copied.
Flinksql：select count（*） from table； No value present in Option #13019

flink版本1.14.6和hudi版本0.13.1，hadoop版本3.1.0
执行 Flinksql：select count（*） from table；
偶然发生 java.util.NoSuchElementException: No value present in Option

Impact

hudi版本0.13.1

update MergeOnReadInputFormat

//        Option<HoodieRecord> resultRecord = recordMerger.merge(hoodieAvroIndexedRecord, tableSchema, record, tableSchema, payloadProps).map(Pair::getLeft);
//        return resultRecord.get().toIndexedRecord(tableSchema, new Properties());

        Option<Pair<HoodieRecord, Schema>> mergeResult  = recordMerger.merge(hoodieAvroIndexedRecord, tableSchema, record, tableSchema, payloadProps);
        Option<HoodieAvroIndexedRecord> resultRecord = mergeResult.map(p -> (HoodieAvroIndexedRecord) p.getLeft());
        return resultRecord;

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…e` to root pom (apache#7774)

This commit adds the missing Apache License in some source files.

This commit fixes `scripts/release/validate_staged_release.sh` to skip checking `release/release_guide*` for "Binary Files Check" and "Licensing Check".

Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests. Tests that are fixed: TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable, testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile) Reasoning for flakiness: we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator). Fixes: HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10. Fixed respective MOR tests to disable small file handling.

…om Metadata Table (apache#7642) Most recently, while trying to use Metadata Table in Bloom Index it was resulting in failures due to exhaustion of S3 connection pool no matter how (reasonably big) we're setting the pool size (we've tested up to 3k connections). This PR focuses on optimizing the Bloom Index lookup sequence in case when it's leveraging Bloom Filter partition in Metadata Table. The premise of this change is based on the following observations: Increasing the size of the batch of the requests to MT allows to amortize the cost of processing it (bigger the batch, lesser the cost). Having too few partitions in the Bloom Index path however, starts to hurt parallelism when we actually probe individual files whether they actually contain target keys or not. Solution to this is to split these 2 in different stages w/ drastically different parallelism levels: constrain parallelism when reading from MT (10s of tasks) and keep at the current level for probing individual files (100s of tasks) Current way of partitioning records (relying on Spark's default partitioner) was entailing that every Spark executor with high likelihood will be opening up (and processing) every file-group of the MT Bloom Filter partition. To alleviate that same hashing algorithm used by MT should be used to partition records into Spark's individual partitions, so that we can limit every task to open no more than 1 file-group in Bloom Filter's partition of MT To achieve that following changes in Bloom Index sequence (leveraging MT) are implemented Bloom Filter probing and actual File Probing are split into 2 separate operations (so that parallelism of each of them could be controlled individually) Requests to MT are replaced to invoke batch APIs Custom partitioner is introduced AffineBloomIndexFileGroupPartitioner repartitioning dataset of filenames with corresponding record keys in a way that is affine w/ MT Bloom Filters' partitioning (allowing us to open no more than a single file-group per Spark's task) Additionally, this PR addresses some of the low-hanging performance optimizations that could considerably improve performance of the Bloom Index lookup sequence like mapping file-comparison pairs to PairRDD (where key is file-name, and value is record-key) instead of RDD so that we could: Do in-partition sorting by filename (to make sure we check all records w/in the file all at once) w/in a single Spark partition instead of global one (reducing shuffling as well) Avoid re-shuffling (by re-mapping from RDD to PairRDD later)

…vider client (apache#7758)

…#7476) This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)

Fixing flaky parquet projection tests. Added 10% margin for expected bytes from col projection.

)

Change logging mode names for CDC feature to - op_key_only - data_before - data_before_after

…er-bundle` to root pom (apache#7774)" (apache#7782) This reverts commit 7352661.

…che#7770)

…che#7759) Updates the HoodieAvroRecordMerger to use the new precombine API instead of the deprecated one. This fixes issues with backwards compatibility with certain payloads.

We introduced a new way to scan log blocks in LogRecordReader and have named it as "hoodie.log.record.reader.use.scanV2". Fixing the config name to be elegant: "hoodie.optimized.log.blocks.scan.enable". Fixing the corresponding Metadata config as well.

Fix tests and artifact deployment for metaserver.

…7784) Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.

…eaming ingest (apache#7783)

Co-authored-by: hbg <[email protected]>

Cleaning up some of the recently introduced configs: Shortening file-listing mode override for Spark's FileIndex Fixing Disruptor's write buffer limit config Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter

)

…ache#7790) - Ensures that Hudi CLI commands which require launching Spark can be executed with hudi-cli-bundle

…#7797)

Fix typos and format text-blocks properly.

…g in duplicate data (apache#8503)

…out ACTION_STATE field (apache#8607)

apache#8631) * Use correct zone id while calculating earliestTimeToRetain * Use metaClient table config

…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.

* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <[email protected]>

…ache#8579)

This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.

…E_UPSERT is disabled (apache#7998)

There was a bug that the delete records are assumed to be marked by "_hoodie_is_deleted"; however, custom CDC payloads use "op" field to mark deletes. In such a case, AWS DMS payload and Debezium payload failed with deletes. This commit fixes the issue by adding a new API isDeleteRecord(GenericRecord genericRecord) in BaseAvroPayload to allow the payload to implement custom logic to indicate if a record is a delete record. Co-authored-by: Raymond Xu <[email protected]>

Co-authored-by: hbg <[email protected]>

…ype as not nullable (apache#8728)

java.util.NoSuchElementException: No value present in Option

hudi-bot · 2025-03-24T07:48:03Z

CI report:

056020d UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hanleiit · 2025-03-25T07:43:35Z

偶然发生 java.util.NoSuchElementException: No value present in Option

hanleiit

hudi0.13.1
#13021

danny0405 · 2025-03-25T09:39:38Z

Maybe this fix is what you need: #8935, see the issue: #8932

hanleiit · 2025-03-26T00:50:45Z

Maybe this fix is what you need: #8935, see the issue: #8932

I have read the fix problem you gave me, but the same bug still appears, get () is empty

hanleiit · 2025-03-26T00:53:41Z

Maybe this fix is what you need: #8935, see the issue: #8932

I have read the fix problem you gave me, but the same bug still appears, get () is empty

danny0405 · 2025-03-26T00:57:49Z

@hanleiit Did you apply the new fix then https://github.com/apache/hudi/pull/8935/files? Can you share with use the full error stacktrace?

hanleiit · 2025-03-27T02:52:30Z

@hanleiit Did you apply the new fix then https://github.com/apache/hudi/pull/8935/files? Can you share with use the full error stacktrace?
I didn't use the new fix https://github.com/apache/hudi/pull/8935/files

hanleiit · 2025-03-27T02:59:48Z

@hanleiit Did you apply the new fix then https://github.com/apache/hudi/pull/8935/files? Can you share with use the full error stacktrace?

the new fix https://github.com/apache/hudi/pull/8935/files should be the same, but why is it not merged into Hudi version 0.13.1, this version of the source code is not modified

danny0405 · 2025-03-27T03:12:21Z

but why is it not merged into Hudi version 0.13.1, this version of the source code is not modified

Because the bug is reported after 0.13.1 is released, we put the fix in 0.14.0 release.

hanleiit · 2025-03-27T07:48:02Z

but why is it not merged into Hudi version 0.13.1, this version of the source code is not modified

Because the bug is reported after 0.13.1 is released, we put the fix in 0.14.0 release.

thank you

yihua and others added 30 commits January 26, 2023 13:18

Create release branch for version 0.13.0.

897d784

[MINOR] Add hudi-platform-service and `hudi-metaserver-server-bundl…

9dc6d79

…e` to root pom (apache#7774)

[HUDI-5635] Fix release scripts (apache#7775)

46cd7f6

[MINOR] Add missing Apache License in source files (apache#7779)

acda6d0

This commit adds the missing Apache License in some source files.

[MINOR] Fix validate_staged_release.sh (apache#7780)

ca72c96

This commit fixes `scripts/release/validate_staged_release.sh` to skip checking `release/release_guide*` for "Binary Files Check" and "Licensing Check".

[HUDI-5627] Improve the usability of Hudi CLI bundle (apache#7762)

c570e6b

[HUDI-5623] Increase default time to wait between retries by lock pro…

df05628

…vider client (apache#7758)

[HUDI-5630] Fixing flaky parquet projection tests (apache#7768)

01acf99

Fixing flaky parquet projection tests. Added 10% margin for expected bytes from col projection.

[HUDI-5629] Clean CDC log files for enable/disable scenario (apache#7767

281b29e

)

[HUDI-5626] Rename CDC logging mode options (apache#7760)

cf52dc2

Change logging mode names for CDC feature to - op_key_only - data_before - data_before_after

[MINOR] Skip docs generation for table service manager (apache#7773)

44cd68b

Revert "[MINOR] Add hudi-platform-service and `hudi-metaserver-serv…

1b592c2

…er-bundle` to root pom (apache#7774)" (apache#7782) This reverts commit 7352661.

[HUDI-5631] Improve defaults of early conflict detection configs (apa…

9495918

…che#7770)

[HUDI-5624] Fix HoodieAvroRecordMerger to use new precombine API (apa…

01d56a6

…che#7759) Updates the HoodieAvroRecordMerger to use the new precombine API instead of the deprecated one. This fixes issues with backwards compatibility with certain payloads.

[HUDI-5638] Fix metaserver test setup (apache#7785)

be1aa61

Fix tests and artifact deployment for metaserver.

[HUDI-5640] Add missing profiles in deploy_staging_jars.sh (apache#…

53c1d5d

…7784) Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.

[HUDI-5639] Fixing stream identifier for single writer with spark str…

8ccb80c

…eaming ingest (apache#7783)

[MINOR] Fix HoodieCDCRDD setting flag usesVirtualKeys (apache#7777)

2ce0869

[HUDI-5503] Optimize flink table factory option check (apache#7608)

48977a6

Co-authored-by: hbg <[email protected]>

[MINOR] Cleaning up recently introduced configs (apache#7772)

5765e1b

Cleaning up some of the recently introduced configs: Shortening file-listing mode override for Spark's FileIndex Fixing Disruptor's write buffer limit config Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter

[HUDI-5629] Clean CDC log files for enable/disable scenario (apache#7786

399549d

)

[HUDI-5637] Add Kryo for hive sync bundle (apache#7781)

cbff2bc

[HUDI-5634] Rename CDC related classes (apache#7410)

b81f49e

[HUDI-5632] Fix failure launching Spark jobs from hudi-cli-bundle (ap…

8ffc7ce

…ache#7790) - Ensures that Hudi CLI commands which require launching Spark can be executed with hudi-cli-bundle

[MINOR] Make data_before_after the default cdc logging mode (apache…

fe75c9a

…#7797)

[HUDI-5563] Check table exist before drop table (apache#7679)

4768408

littleeleventhwolf and others added 16 commits May 17, 2023 11:07

[MINOR][DOC][hudi-metaserver] Fix typos in README.md (apache#8536)

24d8e82

Fix typos and format text-blocks properly.

[HUDI-6047] Clustering operation on consistent hashing index resultin…

5e7ac2d

…g in duplicate data (apache#8503)

[HUDI-6196] Keep compatibility for old version archival instants with…

947adeb

…out ACTION_STATE field (apache#8607)

[HUDI-6170] Use correct zone id while calculating earliestTimeToRetain (

b169f78

apache#8631) * Use correct zone id while calculating earliestTimeToRetain * Use metaClient table config

[HUDI-5308] Hive3 query returns null when the where clause has a part…

30d497a

…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.

[HUDI-5868] Make hudi-spark compatible against Spark 3.3.2 (apache#8082)

e72a6de

* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <[email protected]>

[MINOR] Added docs on gotchas when using PartialUpdateAvroPayload (ap…

842e901

…ache#8579)

[MINOR] Migrate azure-pipelines.yml with notes (apache#8694)

3a1414c

[HUDI-6204] Add bundle validation on Spark 3.3.2 (apache#8692)

f58c73f

This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.

Do not combine records if write operation is Upsert and COMBINE_BEFOR…

eaedc7c

…E_UPSERT is disabled (apache#7998)

[HUDI-6134] Prevent clean run concurrently in flink (apache#8568)

d800630

Co-authored-by: hbg <[email protected]>

[HUDI-6222] ParquetSchemaConverter shoud always convert the Map key t…

ecf3f36

…ype as not nullable (apache#8728)

Create release branch for version 0.13.1

2e9b88a

Create release branch for version 0.13.1

7a65439

Update MergeOnReadInputFormat.java

056020d

java.util.NoSuchElementException: No value present in Option

hanleiit mentioned this pull request Mar 24, 2025

[SUPPORT] Flinksql：select count（*） from table； No value present in Option #13019

Open

github-actions bot added the size:XL PR with lines of changes > 1000 label Mar 24, 2025

hanleiit changed the title ~~0.13.1~~ hudi0.13.1 Mar 25, 2025

hanleiit commented Mar 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hudi0.13.1 #13021

hudi0.13.1 #13021

hanleiit commented Mar 24, 2025 •

edited

Loading

hudi-bot commented Mar 24, 2025

hanleiit commented Mar 25, 2025

hanleiit left a comment

danny0405 commented Mar 25, 2025

hanleiit commented Mar 26, 2025

hanleiit commented Mar 26, 2025 •

edited

Loading

danny0405 commented Mar 26, 2025

hanleiit commented Mar 27, 2025

hanleiit commented Mar 27, 2025 •

edited by danny0405

Loading

danny0405 commented Mar 27, 2025

hanleiit commented Mar 27, 2025

hudi0.13.1 #13021

Are you sure you want to change the base?

hudi0.13.1 #13021

Conversation

hanleiit commented Mar 24, 2025 • edited Loading

Change Logs

Impact

update MergeOnReadInputFormat

Documentation Update

Contributor's checklist

hudi-bot commented Mar 24, 2025

CI report:

hanleiit commented Mar 25, 2025

hanleiit left a comment

Choose a reason for hiding this comment

danny0405 commented Mar 25, 2025

hanleiit commented Mar 26, 2025

hanleiit commented Mar 26, 2025 • edited Loading

danny0405 commented Mar 26, 2025

hanleiit commented Mar 27, 2025

hanleiit commented Mar 27, 2025 • edited by danny0405 Loading

danny0405 commented Mar 27, 2025

hanleiit commented Mar 27, 2025

hanleiit commented Mar 24, 2025 •

edited

Loading

hanleiit commented Mar 26, 2025 •

edited

Loading

hanleiit commented Mar 27, 2025 •

edited by danny0405

Loading