-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325
Conversation
@@ -694,6 +694,16 @@ ModelInferHandler::Process(InferHandler::State* state, bool rpc_ok) | |||
// Handle notification for cancellation which can be raised | |||
// asynchronously if detected on the network. | |||
if (state->IsGrpcContextCancelled()) { | |||
if (rpc_ok && (state->step_ == Steps::START) && | |||
(state->context_->step_ != Steps::CANCELLED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what does the second clause here imply?
!= Steps::Cancelled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid calling StartNewRequest
twice, at first we fall into HandleCancellation and go through this block, which returns true
for resume
, so we will go into if (state->IsGrpcContextCancelled())
loop for the second time but this time state->context_->step_
is CANCELLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late to the game, but what is the reasoning of not moving the original "StartNewRequest()
if at START" to before handling the cancellation? Although I think other code needs to be moved around as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% aware of all underlying processes, meaning state->step_
and state->context_->step_
combinations. This change helps to address the bug with known symptoms. Refactoring if the Process
logic needs proper time and testing IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kthui thoughts? If feasible, this can be done as follow-up and by someone else. Want to make sure if there is room for improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think there is definitely room for improvement/refactoring, i.e. I think the if (shutdown) { ... }
could also be moved into the if (state->step_ == Steps::START) { ... } else ...
block, so all procedures for Steps::START
would be inside the if (state->step_ == Steps::START) { ... }
block, but it can be done as a follow-up later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jira ticket: DLIS-6831
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - had question on the condition
I think we would need similar fix for ModelStreamInferHandler. |
What does the PR do?
This PR fixes an issue with a gRPC ModelInferHandler stopping accepting requests after a number of cancellation received.
Main root-cause and pre-conditions before hand:
At some point cancellation notification is received at step
START
. When this happens, we never skip this blockSince, we never went skip it, we don't create new state for future incoming requests, i.e. call
StartNewRequest
Thus, completion queue becomes exhausted at some point. gRPC requests come, but there is nothing that accepts it on Triton's side.
Introduced changes make sure we create new request handler in those situations.
Added test logic:
I start server and send large amount of inference requests and cancel them right away. Pre fix, the clear identification that there are no
ModelInferHandler
's for any in-coming request is the server stops logging"New request handler for ModelInfer"
, i.egrep -c "New request handler for ModelInfer"
doesn't change. In all pre-fix scenarious it happens after"Cancellation notification received for ModelInferHandler, rpc_ok=1, context 0, [0-9]* step START"
was logged 4 times, 2 times for 1 request. Since we start 2ModelInferHandler
's threads initially, that make sense, as no new Infer handlers were created to handle incoming requests and Triton just keeps processing what it already has.After fix, StartNewRequest is called properly and
"New request handler for ModelInfer"
keeps growing, as well as"Cancellation notification received for ModelInferHandler, rpc_ok=1, context 0, [0-9]* step START"
.Checklist`
<commit_type>: <Title>
Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
Where should the reviewer start?
Test plan:
pre-fix: 15596577
post-fix: 15596692
Caveats:
Background
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)