Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: kill tableReader for each connection correctly #18277

Merged
merged 10 commits into from
Jul 9, 2020

Conversation

fzhedu
Copy link
Contributor

@fzhedu fzhedu commented Jun 30, 2020

What problem does this PR solve?

Issue Number: close a part of #18057

Problem Summary:
The tableReader in the coprocessor may retry many times to read regions, druing which, it cannot be aware of the cancelation from client. It may take much of memory for a long time.

What is changed and how it works?

Proposal: detect Killed to stop the goroutine.

What's Changed: should not use CompareAndSwap to change the Killed before Close(), because many runninig goroutines should detect the Killed.

How it Works: the TableReader the detect the killed in every loop.
The query may not exit quickly, because the query do some close work.

Check List

Tests

  • Integration test
  • Manual test (add detailed scripts or steps below)
    test on mocktikv:
step1. explain analyze select * from lineitem;
step2. kill tidb sessionID on another client;
to see the connection is cancelled.

similarly test on TiFlash is ok.

Release note

  • To speed up canceling a query

@fzhedu fzhedu requested a review from a team as a code owner June 30, 2020 07:54
@fzhedu fzhedu requested review from XuHuaiyu and removed request for a team June 30, 2020 07:54
@qw4990 qw4990 self-requested a review June 30, 2020 07:55
@github-actions github-actions bot added the sig/execution SIG execution label Jun 30, 2020
@fzhedu fzhedu force-pushed the CloseTableReaderSlow branch from 3e3bf31 to ef97b2c Compare July 1, 2020 05:18
Comment on lines -535 to +538
atomic.CompareAndSwapUint32(&sessVars.Killed, 0, 1)
atomic.StoreUint32(&sessVars.Killed, 1)
Copy link
Contributor

@qw4990 qw4990 Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some tricky reason that we use CAS instead of Store here at the begging? Please help us confirm this. @tiancaiamao

Copy link
Contributor

@tiancaiamao tiancaiamao Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception handling is a hard.

If we use CAS, exception handling is the same as error handling:

  • one of the goroutine detect kill (CAS), it return an error
  • the error is throw to the main loop
  • the main loop handle worker's Close

This processing is exactly the same as the error handling, when one of the worker get an error.

Exception handling is much difficult than error handling.

  • The the worker and main loop does not know when the exception happen
  • Every worker has to handle the exception
  • The exception handle is tricky, should the worker kill itself or report error to the main loop and let it kill worker? Will the Close message repeat and Close multiple times?

@fzhedu fzhedu changed the title executor: add cancel fun in tableReader for each connection to kill it quickly executor: kill tableReader for each connection correctly Jul 1, 2020
@@ -217,7 +221,18 @@ func (s *RegionRequestSender) SendReqCtx(
if err != nil {
return nil, nil, errors.Trace(err)
}

// recheck whether the session/query is killed during the Next()
if bo.vars != nil && bo.vars.Killed != nil && atomic.LoadUint32(bo.vars.Killed) == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the only change necessary is to check CAS(killed) here?
And another question is, why backoff does not check it.

@tiancaiamao
Copy link
Contributor

I've find the real cause.

@tiancaiamao
Copy link
Contributor

The problem is that current implementation do not really support "cancel" something.
The cancel is implemented by by caller to set the Killed flag and let the callee to inspect it.
If the callee is block on network sending, it has no change to detect Killed.

The callee may block on the network for a long time, because in the rpc error scenario,
send request timeout is not accumulated into the backoff duration.

https://github.com/pingcap/tidb/pull/17135/files#diff-deb6c7bcc2ba593559c11d2e92f8d8dcR56

I do not have any better ideas to fix those issues eithor, so this commit LGTM

@tiancaiamao tiancaiamao added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 6, 2020
@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 6, 2020

/run-unit-test

@fzhedu fzhedu force-pushed the CloseTableReaderSlow branch from ab6e0a4 to bfbdf42 Compare July 6, 2020 13:49
Copy link
Contributor

@qw4990 qw4990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot
Copy link
Contributor

@qw4990,Thanks for your review.

@qw4990
Copy link
Contributor

qw4990 commented Jul 7, 2020

/merge

@ti-srebot
Copy link
Contributor

Sorry @qw4990, you don't have permission to trigger auto merge event on this branch. The number of LGTM for this PR is 1 while it needs 2 at least

@qw4990 qw4990 added the status/LGT2 Indicates that a PR has LGTM 2. label Jul 7, 2020
@SunRunAway
Copy link
Contributor

/lgtm
/merge

@SunRunAway
Copy link
Contributor

/merge

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 9, 2020

/label type/3.0-cherry-pick

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 9, 2020

/run-cherry-picker

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 13, 2020

/label needs-cherry-pick-4.0

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 13, 2020

/label needs-cherry-pick-3.0

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jul 13, 2020
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #18505

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 13, 2020

/label needs-cherry-pick-3.1

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jul 13, 2020
@ti-srebot
Copy link
Contributor

cherry pick to release-3.0 in PR #18506

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jul 13, 2020
@ti-srebot
Copy link
Contributor

cherry pick to release-3.1 in PR #18507

Comment on lines +403 to +405
type vars struct {
killed *uint32
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are unnecessary, so how about removing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are unnecessary, so how about removing them?

ok, I will remove in my next PR.

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 15, 2020

/unlabel type/4.0-cherry-pick

@fzhedu
Copy link
Contributor Author

fzhedu commented Jul 15, 2020

/unlabel type/3.0-cherry-pick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants