fix for message loss for issue 826 #827

sangreal · 2024-09-02T07:45:16Z

Description...
fix for #826.
Explanation:
The root cause is that :

currently the offset to be committed is according to offsetHighestSeen , but this variable is updated as soon as it is polled from kafka and not processed yet.
The committer is independent of the message processing since it is periodic committer within BrokerPollSystem
right before the container is ready to be processed, the partition revoked.
And the container is regarded as stale but this is correct behavior but the problem is that the offset is committed, the new consumer will skip this record.

The issue is rare because it has to met with following conditions:

1. before the work container is executed (selected ready to be run), the partition revoked (close / rebalancing)
2. there is commit happen, and it is right after the partition revoked
3. the timing of revoking should be happening right after check isDirty

check the doc : https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html

Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset.

Repro log, the miss offset is 329888


2024-08-31 00:08:50.791 +09:00 [pc-control] TRACE i.c.p.state.PartitionState -- Updating highest seen - was: 329887 now: 329888
2024-08-31 00:08:50.831 +09:00 [pc-broker-poll] DEBUG i.c.p.i.AbstractOffsetCommitter -- Will commit offsets for 3 partition(s): {order.message_loss_repro_test.1-5=OffsetAndMetadata{offset=329937, leaderEpoch=null, metadata=''}, order.message_loss_repro_test.1-4=OffsetAndMetadata{offset=333666, leaderEpoch=null, metadata=''}, order.message_loss_repro_test.1-3=OffsetAndMetadata{offset=329889, leaderEpoch=null, metadata=''}}
2024-08-31 00:08:50.831 +09:00 [pc-broker-poll] DEBUG i.c.p.i.AbstractParallelEoSStreamProcessor -- Partitions revoked [order.message_loss_repro_test.1-3, order.message_loss_repro_test.1-4, order.message_loss_repro_test.1-5], state: RUNNING
2024-08-31 00:08:50.832 +09:00 [pc-control] DEBUG i.c.p.state.PartitionState -- Epoch mismatch 0 vs -1 for record WorkContainer(tp:order.message_loss_repro_test.1-3:o:329888:k:46283). Skipping message - it's partition has already assigned to a different consumer.
2024-08-31 00:08:50.832 +09:00 [pc-control] DEBUG i.c.p.state.PartitionState -- Work is in queue with stale epoch or no longer assigned. Skipping.

2024-08-31 00:08:50.784 : consumer receive the message
2024-08-31 00:08:50.791 : update the offset to be commit 329888 (this is wrong logic, which causing the issue)
2024-08-31 00:08:50.831 : offset committed
2024-08-31 00:08:50.831 : partition revoked
2024-08-31 00:08:50.832: while processing the message, since the partition is revoked, 329888 is regarded as stale, is filtered out without processing.
Then message loss happened. Because the offset is committed.

Checklist

Documentation (if applicable)
Changelog

…onfluentinc#826

sangreal · 2024-09-04T12:25:55Z

check the doc : https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html

Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset.

sangreal · 2024-09-23T10:41:00Z

@rkolesnev sorry to bother you but please have a check on the pr and let me know if anything need to be done. Thanks a lot.

rkolesnev · 2024-09-23T13:53:08Z

@sangreal - no bother at all - thanks for reminding me in fact. Yes it all LGTM now - will merge now and try to release shortly - this week.

flashmouse · 2024-10-14T10:38:25Z

hi @rkolesnev sorry to disturb, just want to know when publish a new release include this fix...

rkolesnev · 2024-10-16T09:59:27Z

Hi @flashmouse, @sangreal - working on it this week, sorry for the delay.

rkolesnev · 2024-10-21T13:13:04Z

@flashmouse , @sangreal - finally released the 0.5.3.2 version which includes this PR.

flashmouse · 2024-10-22T02:10:48Z

@flashmouse , @sangreal - finally released the 0.5.3.2 version which includes this PR.

Thanks! we'll try it.

use offsetHighestSucceeded instead of offsetHighestSeen to fix issue c…

f3aad6b

…onfluentinc#826

sangreal requested a review from a team as a code owner September 2, 2024 07:45

sangreal and others added 4 commits September 2, 2024 15:47

update readme

e728cf1

Merge branch 'master' into fix/issue-826

2a78a9c

add ut

ed0de3e

add comments

306167e

remove ut

56ba7fe

rkolesnev approved these changes Sep 23, 2024

View reviewed changes

rkolesnev merged commit ffcafda into confluentinc:master Sep 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for message loss for issue 826 #827

fix for message loss for issue 826 #827

sangreal commented Sep 2, 2024 •

edited

Loading

sangreal commented Sep 4, 2024

sangreal commented Sep 23, 2024

rkolesnev commented Sep 23, 2024

flashmouse commented Oct 14, 2024

rkolesnev commented Oct 16, 2024

rkolesnev commented Oct 21, 2024

flashmouse commented Oct 22, 2024

fix for message loss for issue 826 #827

fix for message loss for issue 826 #827

Conversation

sangreal commented Sep 2, 2024 • edited Loading

Checklist

sangreal commented Sep 4, 2024

sangreal commented Sep 23, 2024

rkolesnev commented Sep 23, 2024

flashmouse commented Oct 14, 2024

rkolesnev commented Oct 16, 2024

rkolesnev commented Oct 21, 2024

flashmouse commented Oct 22, 2024

sangreal commented Sep 2, 2024 •

edited

Loading