fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325

oandreeva-nv · 2024-06-05T18:49:37Z

What does the PR do?

This PR fixes an issue with a gRPC ModelInferHandler stopping accepting requests after a number of cancellation received.

Main root-cause and pre-conditions before hand:
At some point cancellation notification is received at step START. When this happens, we never skip this block

Since, we never went skip it, we don't create new state for future incoming requests, i.e. call StartNewRequest

Thus, completion queue becomes exhausted at some point. gRPC requests come, but there is nothing that accepts it on Triton's side.

Introduced changes make sure we create new request handler in those situations.
Added test logic:

I start server and send large amount of inference requests and cancel them right away. Pre fix, the clear identification that there are no ModelInferHandler's for any in-coming request is the server stops logging "New request handler for ModelInfer", i.e
grep -c "New request handler for ModelInfer" doesn't change. In all pre-fix scenarious it happens after
"Cancellation notification received for ModelInferHandler, rpc_ok=1, context 0, [0-9]* step START" was logged 4 times, 2 times for 1 request. Since we start 2 ModelInferHandler's threads initially, that make sense, as no new Infer handlers were created to handle incoming requests and Triton just keeps processing what it already has.

After fix, StartNewRequest is called properly and "New request handler for ModelInfer" keeps growing, as well as "Cancellation notification received for ModelInferHandler, rpc_ok=1, context 0, [0-9]* step START".

Checklist`

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:
pre-fix: 15596577
post-fix: 15596692

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

nnshah1 · 2024-06-05T19:02:23Z

src/grpc/infer_handler.cc

@@ -694,6 +694,16 @@ ModelInferHandler::Process(InferHandler::State* state, bool rpc_ok)
  // Handle notification for cancellation which can be raised
  // asynchronously if detected on the network.
  if (state->IsGrpcContextCancelled()) {
+    if (rpc_ok && (state->step_ == Steps::START) &&
+        (state->context_->step_ != Steps::CANCELLED)) {


question: what does the second clause here imply?

!= Steps::Cancelled ?

To avoid calling StartNewRequest twice, at first we fall into HandleCancellation and go through this block, which returns true for resume, so we will go into if (state->IsGrpcContextCancelled()) loop for the second time but this time state->context_->step_ is CANCELLED

Late to the game, but what is the reasoning of not moving the original "StartNewRequest() if at START" to before handling the cancellation? Although I think other code needs to be moved around as well.

I am not 100% aware of all underlying processes, meaning state->step_ and state->context_->step_ combinations. This change helps to address the bug with known symptoms. Refactoring if the Process logic needs proper time and testing IMHO

@kthui thoughts? If feasible, this can be done as follow-up and by someone else. Want to make sure if there is room for improvement.

Yes, I think there is definitely room for improvement/refactoring, i.e. I think the if (shutdown) { ... } could also be moved into the if (state->step_ == Steps::START) { ... } else ... block, so all procedures for Steps::START would be inside the if (state->step_ == Steps::START) { ... } block, but it can be done as a follow-up later.

Jira ticket: DLIS-6831

nnshah1

LGTM - had question on the condition

qa/L0_request_cancellation/grpc_cancellation_test.py

tanmayv25 · 2024-06-05T19:09:05Z

I think we would need similar fix for ModelStreamInferHandler.

qa/L0_request_cancellation/test.sh

…7325)

oandreeva-nv added 3 commits June 5, 2024 11:04

grpc side fixes

5237c24

Tests

e82d5c7

Clean up

6857283

oandreeva-nv added module: server Issues related to the server core PR: fix A bug fix labels Jun 5, 2024

oandreeva-nv requested review from kthui, tanmayv25 and nnshah1 and removed request for kthui June 5, 2024 18:49

nnshah1 reviewed Jun 5, 2024

View reviewed changes

nnshah1 previously approved these changes Jun 5, 2024

View reviewed changes

tanmayv25 reviewed Jun 5, 2024

View reviewed changes

qa/L0_request_cancellation/grpc_cancellation_test.py Outdated Show resolved Hide resolved

oandreeva-nv added 3 commits June 5, 2024 14:47

Was validating stream infer

cd34d14

Added delay for ModelInferHandler::Process for debug

707b3d9

Test refactor

763070d

oandreeva-nv dismissed nnshah1’s stale review via 763070d June 6, 2024 00:08

oandreeva-nv added 2 commits June 5, 2024 17:18

Clean up

f0acc1a

Clean up

2b14468

oandreeva-nv requested review from tanmayv25 and nnshah1 June 6, 2024 00:24

kthui reviewed Jun 6, 2024

View reviewed changes

qa/L0_request_cancellation/test.sh Outdated Show resolved Hide resolved

Converted tests to python

222c9a9

oandreeva-nv requested review from kthui and GuanLuo June 6, 2024 18:14

GuanLuo approved these changes Jun 6, 2024

View reviewed changes

kthui approved these changes Jun 6, 2024

View reviewed changes

tanmayv25 approved these changes Jun 6, 2024

View reviewed changes

oandreeva-nv merged commit 42742a3 into main Jun 6, 2024
3 checks passed

oandreeva-nv deleted the oandreeva_grpc_fix branch June 6, 2024 22:23

krishung5 pushed a commit that referenced this pull request Jun 11, 2024

fix: Handling grpc cancellation edge-case:: Cancelling at step START (#…

8c2d359

…7325)

rmccorm4 mentioned this pull request Jul 11, 2024

gRPC Segfaults in Triton 24.08 due to Low Request Cancellation Timeout #7368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325

fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325

oandreeva-nv commented Jun 5, 2024

nnshah1 Jun 5, 2024

oandreeva-nv Jun 5, 2024

GuanLuo Jun 6, 2024

oandreeva-nv Jun 6, 2024

GuanLuo Jun 6, 2024

kthui Jun 6, 2024

oandreeva-nv Jun 6, 2024

nnshah1 left a comment

tanmayv25 commented Jun 5, 2024

fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325

fix: Handling grpc cancellation edge-case:: Cancelling at step START #7325

Conversation

oandreeva-nv commented Jun 5, 2024

What does the PR do?

Checklist`

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

nnshah1 Jun 5, 2024

Choose a reason for hiding this comment

oandreeva-nv Jun 5, 2024

Choose a reason for hiding this comment

GuanLuo Jun 6, 2024

Choose a reason for hiding this comment

oandreeva-nv Jun 6, 2024

Choose a reason for hiding this comment

GuanLuo Jun 6, 2024

Choose a reason for hiding this comment

kthui Jun 6, 2024

Choose a reason for hiding this comment

oandreeva-nv Jun 6, 2024

Choose a reason for hiding this comment

nnshah1 left a comment

Choose a reason for hiding this comment

tanmayv25 commented Jun 5, 2024