Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for message loss for issue 826 #827

Merged
merged 6 commits into from
Sep 23, 2024

Conversation

sangreal
Copy link
Contributor

@sangreal sangreal commented Sep 2, 2024

Description...
fix for #826.
Explanation:
The root cause is that :

  1. currently the offset to be committed is according to offsetHighestSeen , but this variable is updated as soon as it is polled from kafka and not processed yet.
  2. The committer is independent of the message processing since it is periodic committer within BrokerPollSystem
  3. right before the container is ready to be processed, the partition revoked.
  4. And the container is regarded as stale but this is correct behavior but the problem is that the offset is committed, the new consumer will skip this record.

The issue is rare because it has to met with following conditions:

1. before the work container is executed (selected ready to be run), the partition revoked (close / rebalancing)
2. there is commit happen, and it is right after the partition revoked
3. the timing of revoking should be happening right after check isDirty 

check the doc : https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html

Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset.

Repro log, the miss offset is 329888


2024-08-31 00:08:50.791 +09:00 [pc-control] TRACE i.c.p.state.PartitionState -- Updating highest seen - was: 329887 now: 329888
2024-08-31 00:08:50.831 +09:00 [pc-broker-poll] DEBUG i.c.p.i.AbstractOffsetCommitter -- Will commit offsets for 3 partition(s): {order.message_loss_repro_test.1-5=OffsetAndMetadata{offset=329937, leaderEpoch=null, metadata=''}, order.message_loss_repro_test.1-4=OffsetAndMetadata{offset=333666, leaderEpoch=null, metadata=''}, order.message_loss_repro_test.1-3=OffsetAndMetadata{offset=329889, leaderEpoch=null, metadata=''}}
2024-08-31 00:08:50.831 +09:00 [pc-broker-poll] DEBUG i.c.p.i.AbstractParallelEoSStreamProcessor -- Partitions revoked [order.message_loss_repro_test.1-3, order.message_loss_repro_test.1-4, order.message_loss_repro_test.1-5], state: RUNNING
2024-08-31 00:08:50.832 +09:00 [pc-control] DEBUG i.c.p.state.PartitionState -- Epoch mismatch 0 vs -1 for record WorkContainer(tp:order.message_loss_repro_test.1-3:o:329888:k:46283). Skipping message - it's partition has already assigned to a different consumer.
2024-08-31 00:08:50.832 +09:00 [pc-control] DEBUG i.c.p.state.PartitionState -- Work is in queue with stale epoch or no longer assigned. Skipping.
2024-08-31 00:08:50.784 : consumer receive the message
2024-08-31 00:08:50.791 : update the offset to be commit 329888 (this is wrong logic, which causing the issue)
2024-08-31 00:08:50.831 : offset committed
2024-08-31 00:08:50.831 : partition revoked
2024-08-31 00:08:50.832: while processing the message, since the partition is revoked, 329888 is regarded as stale, is filtered out without processing.
Then message loss happened. Because the offset is committed.

Checklist

  • Documentation (if applicable)
  • Changelog

@sangreal sangreal requested a review from a team as a code owner September 2, 2024 07:45
@sangreal
Copy link
Contributor Author

sangreal commented Sep 4, 2024

check the doc : https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html

Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset.

@sangreal
Copy link
Contributor Author

@rkolesnev sorry to bother you but please have a check on the pr and let me know if anything need to be done. Thanks a lot.

@rkolesnev
Copy link
Contributor

@sangreal - no bother at all - thanks for reminding me in fact. Yes it all LGTM now - will merge now and try to release shortly - this week.

@rkolesnev rkolesnev merged commit ffcafda into confluentinc:master Sep 23, 2024
1 check passed
@flashmouse
Copy link
Contributor

hi @rkolesnev sorry to disturb, just want to know when publish a new release include this fix...

@rkolesnev
Copy link
Contributor

Hi @flashmouse, @sangreal - working on it this week, sorry for the delay.

@rkolesnev
Copy link
Contributor

@flashmouse , @sangreal - finally released the 0.5.3.2 version which includes this PR.

@flashmouse
Copy link
Contributor

@flashmouse , @sangreal - finally released the 0.5.3.2 version which includes this PR.

Thanks! we'll try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants