-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: kill tableReader for each connection correctly #18277
Conversation
3e3bf31
to
ef97b2c
Compare
atomic.CompareAndSwapUint32(&sessVars.Killed, 0, 1) | ||
atomic.StoreUint32(&sessVars.Killed, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some tricky reason that we use CAS
instead of Store
here at the begging? Please help us confirm this. @tiancaiamao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception handling is a hard.
If we use CAS
, exception handling is the same as error handling:
- one of the goroutine detect kill (CAS), it return an error
- the error is throw to the main loop
- the main loop handle worker's Close
This processing is exactly the same as the error handling, when one of the worker get an error.
Exception handling is much difficult than error handling.
- The the worker and main loop does not know when the exception happen
- Every worker has to handle the exception
- The exception handle is tricky, should the worker kill itself or report error to the main loop and let it kill worker? Will the Close message repeat and Close multiple times?
@@ -217,7 +221,18 @@ func (s *RegionRequestSender) SendReqCtx( | |||
if err != nil { | |||
return nil, nil, errors.Trace(err) | |||
} | |||
|
|||
// recheck whether the session/query is killed during the Next() | |||
if bo.vars != nil && bo.vars.Killed != nil && atomic.LoadUint32(bo.vars.Killed) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the only change necessary is to check CAS(killed)
here?
And another question is, why backoff does not check it.
I've find the real cause. |
The problem is that current implementation do not really support "cancel" something. The callee may block on the network for a long time, because in the rpc error scenario, https://github.com/pingcap/tidb/pull/17135/files#diff-deb6c7bcc2ba593559c11d2e92f8d8dcR56 I do not have any better ideas to fix those issues eithor, so this commit LGTM |
/run-unit-test |
… should detect kill == 1
ab6e0a4
to
bfbdf42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@qw4990,Thanks for your review. |
/merge |
Sorry @qw4990, you don't have permission to trigger auto merge event on this branch. The number of |
/lgtm |
/merge |
/label type/3.0-cherry-pick |
/run-cherry-picker |
/label needs-cherry-pick-4.0 |
/label needs-cherry-pick-3.0 |
Signed-off-by: ti-srebot <[email protected]>
cherry pick to release-4.0 in PR #18505 |
/label needs-cherry-pick-3.1 |
Signed-off-by: ti-srebot <[email protected]>
cherry pick to release-3.0 in PR #18506 |
Signed-off-by: ti-srebot <[email protected]>
cherry pick to release-3.1 in PR #18507 |
type vars struct { | ||
killed *uint32 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are unnecessary, so how about removing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are unnecessary, so how about removing them?
ok, I will remove in my next PR.
/unlabel type/4.0-cherry-pick |
/unlabel type/3.0-cherry-pick |
…8506) Signed-off-by: ti-srebot <[email protected]>
What problem does this PR solve?
Issue Number: close a part of #18057
Problem Summary:
The tableReader in the coprocessor may retry many times to read regions, druing which, it cannot be aware of the cancelation from client. It may take much of memory for a long time.
What is changed and how it works?
Proposal: detect
Killed
to stop the goroutine.What's Changed: should not use
CompareAndSwap
to change theKilled
before Close(), because many runninig goroutines should detect theKilled
.How it Works: the TableReader the detect the
killed
in every loop.The query may not exit quickly, because the query do some close work.
Check List
Tests
test on mocktikv:
similarly test on TiFlash is ok.
Release note