Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] ManagedCursor: mark delete no callback when create meta-ledger fail #16841

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jul 28, 2022

Fixes: #16711

Motivation

If the meta-ledger fails to be initialized when mark delete is executed, the callback of mark delete will not execute anymore. this case will occur in 1/1000 probability. You can reproduce it by doing this: "Run unit test ManagedCursorTest.markDeleteWithZKErrors 1000 times".

This problem also makes unit test ManagedCursorTest.markDeleteWithZKErrors flaky. #16711

When the problem occurs, the actual execution process is as follows:

Flow-1. The process we expect

Time cursor mark deleted meta thread
1 synchronized pendingMarkDeleteOps
2 check meta-ledger state
3 do create ledger
4 append to pending requests queue
4 synchronized pendingMarkDeleteOps( High light )
5 create ledger fail
6 loop pending requests, and fail callback

Flow-2. The process we did not expect. However, Thread cursor mark deleted and meta thread may be assigned to the same thread, the actual maybe execution process is as follows:

Time cursor mark deleted/meta thread
1 synchronized pendingMarkDeleteOps
2 check meta-ledger state
3 do create ledger
4 synchronized pendingMarkDeleteOps ( High light ) reentrant lock by the same thread, which we did not expect
5 create ledger fail
6 loop pending requests, and fail callback
7 append to pending requests queue
8 waiting callback...
9 waiting callback...
  • Each column means the individual threads.
  • Column Time is used only to indicate the order of each step, not the actual time.
  • The important steps are explained below:

step-4: If the ledger fails to be created, will trigger a "fail back" for the pending requests, and the requests that have not been queued will be ignored.

public void operationFailed(ManagedLedgerException exception) {
log.error("[{}][{}] Metadata ledger creation failed", ledger.getName(), name, exception);
synchronized (pendingMarkDeleteOps) {
while (!pendingMarkDeleteOps.isEmpty()) {
MarkDeleteEntry entry = pendingMarkDeleteOps.poll();
entry.callback.markDeleteFailed(exception, entry.ctx);
}

step-5 ( High light ): If the meta ledger needs to be created, create ledger will be triggered first and the current request will be put into the pending requests queue. It is possible that create ledger fail has been completed before the request is put into the queue, so this request will not get the callback anymore( see flow2 above ).

case NoLedger:
// We need to create a new ledger to write into
startCreatingNewMetadataLedger();
// fall through
case SwitchingLedger:
pendingMarkDeleteOps.add(mdEntry);
break;

Modifications

  • Selected plain ( High light )
    Makes put request into the pending queue executed before create ledger.

  • Rejected plan
    Makes put request into the pending queue and create ledger execute serially. Why rejected this plan? Because plan-1 fewer changes.

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@poorbarcode
Copy link
Contributor Author

@315157973 @codelipenghui @rdhabalia Could you take a look.

@poorbarcode poorbarcode changed the title [improve][broker] ManagedCursor: mark delete fast fail when create meta-ledger fail [fix][broker] ManagedCursor: mark delete timeout when create meta-ledger fail Jul 28, 2022
@poorbarcode poorbarcode changed the title [fix][broker] ManagedCursor: mark delete timeout when create meta-ledger fail [fix][broker] ManagedCursor: mark delete no callback when create meta-ledger fail Jul 28, 2022
@Technoboy- Technoboy- added this to the 2.11.0 milestone Jul 28, 2022
@codelipenghui
Copy link
Contributor

step-5 ( High light ): If the meta ledger needs to be created, create ledger will be triggered first and the current request will be put into the pending requests queue. It is possible that step 4 has been completed before the request is put into the queue, so this request will not get the callback anymore.

We have synchronized (pendingMarkDeleteOps) for both create cursor failure and internalAsyncMarkDelete, if triggered the ledger creation means the left thread got the lock and the right thread can only get the lock after the mdEntry is added to the pendingMarkDeleteOps , no?

@poorbarcode
Copy link
Contributor Author

Hi @codelipenghui

We have synchronized (pendingMarkDeleteOps) for both create cursor failure and internalAsyncMarkDelete, if triggered the ledger creation means the left thread got the lock and the right thread can only get the lock after the mdEntry is added to the pendingMarkDeleteOps , no?

According to your tips, I found a new problem: see Flow-2 above. And I have modified the Motivation.

@codelipenghui @Technoboy- @mattisonchao Could you review this PR again?

@Technoboy- Technoboy- added type/bug The PR fixed a bug or issue reported a bug doc-not-needed Your PR changes do not impact docs release/2.7.6 release/2.8.4 release/2.10.2 release/2.9.4 labels Jul 28, 2022
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@keyboardbobo
Copy link

I also encountered a similar scenario here. The pendingMarkDeleteOps has accumulated tens of thousands of requests, and the cursor status has always been CreatingLedger, and the retention cannot be executed, resulting in the situation that the disk space cannot be reclaimed. I don't know if it's the same problem?

@poorbarcode poorbarcode force-pushed the fix/flaky/markDeleteConcurrentWithCreateLedger branch from 5d22abb to 9b86fb7 Compare July 28, 2022 13:47
@poorbarcode
Copy link
Contributor Author

poorbarcode commented Jul 28, 2022

@keyboardbobo

and the cursor status has always been CreatingLedger

In ManagedCursorImpl, there has not any stat names creatingLedger, I think you perhaps say ManagedCursorImpl.SwitchingLedger? Otherwise, maybe you say ManagedLedgerImpl.CreatingLedger?

I also encountered a similar scenario here. The pendingMarkDeleteOps has accumulated tens of thousands of requests, and the cursor status has always been CreatingLedger, and the retention cannot be executed, resulting in the situation that the disk space cannot be reclaimed. I don't know if it's the same problem?

  • If you want to say ManagedCursorImpl.SwitchingLedger. I am afraid not. The stat of ManagedCursor will change like this(even if create ledger failure, it will not stay at SwitchingLedger), and I'll keep checking the code to see if there's anything else:
NoLedger --> SwitchingLedger --> open ledger failure --> NoLedger --> SwitchingLedger .... loop
  • If you want to say ManagedLedgerImpl.CreatingLedger. Could you provide more information? It is best to submit an issue.

Copy link
Member

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@keyboardbobo
Copy link

keyboardbobo commented Jul 29, 2022

FB2D4A6C-5887-4FC4-932F-5E896F08B2F7
@poorbarcode
This is the result of the dump after the problem. As for what caused it, I am still analyzing it. If you have any ideas, please remind me

@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@poorbarcode
Copy link
Contributor Author

Hi @keyboardbobo

If you have any ideas, please remind me

Thank you for the information,I have open a issue #16859, We can discuss there.

@Technoboy- Technoboy- merged commit 5faac76 into apache:master Jul 29, 2022
Jason918 pushed a commit that referenced this pull request Jul 29, 2022
@Jason918 Jason918 added cherry-picked/branch-2.7 Archived: 2.7 is end of life release/2.7.5 and removed release/2.7.6 labels Jul 29, 2022
BewareMyPower pushed a commit that referenced this pull request Aug 1, 2022
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Aug 1, 2022
Gleiphir2769 pushed a commit to Gleiphir2769/pulsar that referenced this pull request Aug 4, 2022
@mattisonchao mattisonchao added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Aug 4, 2022
mattisonchao pushed a commit that referenced this pull request Aug 4, 2022
codelipenghui pushed a commit that referenced this pull request Aug 8, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Aug 16, 2022
…-ledger fail (apache#16841)

(cherry picked from commit 5faac76)
(cherry picked from commit 86ceb3f)
@poorbarcode poorbarcode deleted the fix/flaky/markDeleteConcurrentWithCreateLedger branch September 17, 2022 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.7 Archived: 2.7 is end of life cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.7.5 release/2.8.4 release/2.9.4 release/2.10.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: ManagedCursorTest.markDeleteWithZKErrors
8 participants