Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unordered consuming case in Key_Shared subscription #12885

Closed
Jason918 opened this issue Nov 19, 2021 · 3 comments · Fixed by #12890
Closed

[Bug] Unordered consuming case in Key_Shared subscription #12885

Jason918 opened this issue Nov 19, 2021 · 3 comments · Fixed by #12890
Assignees
Labels
release/2.8.3 release/2.9.1 type/bug The PR fixed a bug or issue reported a bug
Milestone

Comments

@Jason918
Copy link
Contributor

Jason918 commented Nov 19, 2021

Describe the bug
It's a previous flaky test in org.apache.pulsar.client.api.KeySharedSubscriptionTest#testRemoveFirstConsumer
See #11426

There is a race condition in PersistentDispatcherMultipleConsumers when previous client closed with unack messaages.

Detailed Explanation:
Here is the core steps of this case.

  1. produce 10 messages: m0~m9
  2. wait unitl c1 got all the 10 message.
  3. create c2 with same setting as c1.
  4. produce 10 messages: m10~m19
  5. c2 will not be able to receive any messages until c1 is done processing whatever he got prefetched.
  6. c1.close();
  7. Call c2.receive() now, It should receive m0. But there is some chance it will return m10.

The race condition happens after step6.
Just before step6, there is a retry loop going on for this consumer.

Thread 1:
1.1 PersistentDispatcherMultipleConsumers#readMoreEntries //synchronized
1.2     calculateToRead() return (10,xxx)
1.3     getMessagesToReplayNow() return m10-m14
1.4     havePendingReplayRead = true
1.5     asyncReplayEntriesInOrder(m10-m14) --> cursor.asyncReplayEntries() --> will callback readEntriesComplete async.

Thread 2:
2.1 PersistentDispatcherMultipleConsumers#readEntriesComplete //synchronized
2.2     havePendingReplayRead = false;
2.3     PersistentStickyKeyDispatcherMultipleConsumers#sendMessagesToConsumers(m10-m14)
2.4         //DEBUG Log: select consumer c1 with messages num 0, read type is Replay
            Can not send to c1 because it runs out of permits.
2.5         //DEBUG Log: select consumer c2 with messages num 0, read type is Replay
            Can not send to c2 because it should wait c1 to complete unacked messages.
2.6         Finally, no consumer to send message. So it schedule call readMoreEntries() in 100ms.

After we close c1, PersistentDispatcherMultipleConsumers#removeConsumer will be called like this.

Thread3
3.1PersistentDispatcherMultipleConsumers#removeConsumer
3.2        consumer.getPendingAcks().forEach : addMessageToReplay m0-m9
3.3        calls readMoreEntries.

If Thread3 happens after Thread2, everything is ok, and this is most of the real cases. As time between Thread1 and Thread2 is quite small.
But if it does happens in the order of Thread1 --> Thread3 --> Thread2. The disorder occurs.
In the readMoreEntries of 3.3 , calculateToRead return -1,-1 because havePendingReplayRead is true in 1.2 of Thread1. So the readMoreEntries will just returns without further actions.
Then in Thread2, previous reading ended, and we got m10-m14 to dispatch. now we can select consumer c2 because we have no more c1 for waiting. So the client c2 got m10 instead of m0.

To Reproduce
Steps to reproduce the behavior:

  1. It easier to reproduce this in local environment by changing the 100ms to 1ms here.
  2. Run Test on org.apache.pulsar.client.api.KeySharedSubscriptionTest#testRemoveFirstConsumer. It will hit the race condition in a few retries.

Expected behavior
As the case designed in testRemoveFirstConsumer. The consuming order should be right.

Screenshots
Nop

Desktop (please complete the following information):

  • OS: macOS

Additional context
NO

@Jason918 Jason918 added the type/bug The PR fixed a bug or issue reported a bug label Nov 19, 2021
@Jason918 Jason918 changed the title [] [Bug] Unordered consuming case in Key_Shared subscription Nov 19, 2021
Jason918 pushed a commit to Jason918/pulsar that referenced this issue Nov 19, 2021
Jason918 pushed a commit to Jason918/pulsar that referenced this issue Nov 19, 2021
@michaeljmarshall
Copy link
Member

@Jason918 - good catch! Do you plan to submit a patch for this bug?

@codelipenghui
Copy link
Contributor

@michaeljmarshall Already have a patch here #12890

@michaeljmarshall
Copy link
Member

@codelipenghui - oh, right. I don't know how I missed the GitHub note right about my comment. Thanks!

Jason918 added a commit to Jason918/pulsar that referenced this issue Nov 21, 2021
Jason918 added a commit to Jason918/pulsar that referenced this issue Nov 26, 2021
Jason918 added a commit to Jason918/pulsar that referenced this issue Nov 30, 2021
Jason918 added a commit to Jason918/pulsar that referenced this issue Nov 30, 2021
zeo1995 pushed a commit to zeo1995/pulsar that referenced this issue Dec 1, 2021
* up/master: (75 commits)
  [website][upgrade]feat: website upgrade / docs migration - 2.5.1 Get Started/Concepts and Architecture/Pulsar Schema (apache#13030)
  Fix environment variable assignment in startup scripts (apache#13025)
  update 2.8.x (apache#13029)
  [Doc] add tips for Pulsar tools (apache#13044)
  Suggest to use tlsPort instead of deprecated TlsEnable (apache#13039)
  Integration tests for function-worker rebalance and drain operations. (apache#13058)
  fix(functions): missing runtime set in GoInstanceConfig (apache#13031)
  [pulsar-admin] Add get-replicated-subscription-status command for topic (apache#12891)
  [Broker] Consider topics in pulsar/system namespace as system topics (apache#13050)
  Fix typo: correct sizeUint to sizeUnit (apache#13040)
  fix-12894 (apache#12896)
  Don't attempt to delete pending ack store unless transactions are enabled (apache#13041)
  [Perf] Evaluate the current protocol version once (apache#13045)
  Fix Issue apache#12885, Unordered consuming case in Key_Shared subscription (apache#12890)
  [broker]Optimize topicMaxMessageSize with topic local cache. (apache#12830)
  [PIP-105] Part-2 Support pluggable entry filter in Dispatcher (apache#12970)
  [website] Modify admin-api-topic.md document (apache#12996)
  add missed import (apache#13037)
  [metadata] Add RocksdbMetadataStore (apache#12776)
  [C] Add pulsar_client_subscribe_multi_topics and pulsar_client_subscribe_pattern (apache#12965)
  ...

# Conflicts:
#	site2/website-next/docusaurus.config.js
#	site2/website-next/versioned_sidebars/version-2.6.1-sidebars.json
#	site2/website-next/versions.json
fxbing pushed a commit to fxbing/pulsar that referenced this issue Dec 19, 2021
zymap pushed a commit that referenced this issue Dec 23, 2021
eolivelli pushed a commit to datastax/pulsar that referenced this issue Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/2.8.3 release/2.9.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants