Fix the pending_request_count to release the failed requests #286

tanmayv25 · 2023-11-04T03:33:12Z

Before

After sending a 136 requests that will fail in Enqueue:

curl -s localhost:8002/metrics | grep pending
# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
# TYPE nv_inference_pending_request_count gauge
nv_inference_pending_request_count{model="test_model",version="1"} 136

After fix

After sending a 136 requests that will fail in Enqueue:

curl -s localhost:8002/metrics | grep pending
# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.
# TYPE nv_inference_pending_request_count gauge
nv_inference_pending_request_count{model="test_model",version="1"} 0

I couldn't find any tests for nv_inference_pending_request_count metrics. Need to add testing of these features to avoid regression. Will open another ticket to enhance the testing.

rmccorm4 · 2023-11-04T03:34:44Z

src/infer_request.cc

+  auto status = request->model_raw_->Enqueue(request);
+  if (!status.IsOk()) {
+    LOG_STATUS_ERROR(
+        request->SetState(InferenceRequest::State::RELEASED),


~~Should the request release callback be called internally on failure? Calling request release should set the state and update accordingly~~ I see in our contract that we don't take ownership of the request if InferAsync -> Run -> Enqueue fails, so it's expected that the user would release the request.

As mentioned offline, I don't love the idea of setting the state to RELEASED when the request hasn't actually had it's release callbacks called. The state should represent that it has actually been released. Is there another way?

Ideally, the request should be set to pending state iff the Enqueue call was a success. However, there is an asynchronous interaction between the enqueue thread and and the backend threads that schedules the requests and moves it out from the queue.
We should not invoke callbacks for the failed requests as the failed requests are owned by the caller.

Added a new state for tracking the failed requests. Discussed offline that the logic might need some clean-up.

rmccorm4 · 2023-11-04T03:36:05Z

Also, existing tests can be found here: https://github.com/triton-inference-server/server/blob/main/qa/L0_metrics/metrics_queue_size_test.py

Feel free to rename test script from queue size to pending request count or something if it makes it easier to find

tanmayv25 requested review from kthui, rmccorm4 and fpetrini15 November 4, 2023 03:33

rmccorm4 reviewed Nov 4, 2023

View reviewed changes

tanmayv25 added 3 commits November 7, 2023 20:04

Fixes the pending_request_count metrics

6b0674c

Fix the pending_request_count to release the failed requests

28901e9

Add a new state to track failed inference runs

7b21c19

tanmayv25 force-pushed the tanmayv-metrics branch from 15397d8 to 7b21c19 Compare November 8, 2023 04:05

tanmayv25 mentioned this pull request Nov 8, 2023

Enhance testing for pending request count triton-inference-server/server#6532

Merged

tanmayv25 requested a review from rmccorm4 November 8, 2023 04:17

rmccorm4 approved these changes Nov 8, 2023

View reviewed changes

tanmayv25 merged commit 1dfb6de into main Nov 8, 2023

tanmayv25 deleted the tanmayv-metrics branch May 31, 2024 00:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the pending_request_count to release the failed requests #286

Fix the pending_request_count to release the failed requests #286

tanmayv25 commented Nov 4, 2023

rmccorm4 Nov 4, 2023 •

edited

Loading

rmccorm4 Nov 7, 2023 •

edited

Loading

tanmayv25 Nov 7, 2023

tanmayv25 Nov 8, 2023

rmccorm4 commented Nov 4, 2023

Fix the pending_request_count to release the failed requests #286

Fix the pending_request_count to release the failed requests #286

Conversation

tanmayv25 commented Nov 4, 2023

Before

After fix

rmccorm4 Nov 4, 2023 • edited Loading

Choose a reason for hiding this comment

rmccorm4 Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

tanmayv25 Nov 7, 2023

Choose a reason for hiding this comment

tanmayv25 Nov 8, 2023

Choose a reason for hiding this comment

rmccorm4 commented Nov 4, 2023

rmccorm4 Nov 4, 2023 •

edited

Loading

rmccorm4 Nov 7, 2023 •

edited

Loading