Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gRPC frontend race condition #7110

Merged
merged 3 commits into from
Apr 17, 2024
Merged

Fix gRPC frontend race condition #7110

merged 3 commits into from
Apr 17, 2024

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Apr 12, 2024

There is a race condition with the InferHandlerState::complete_ variable. When ModelStreamInferHandler::StreamInferResponseComplete() function is called with TRITONSERVER_RESPONSE_COMPLETE_FINAL flag, the InferHandlerState::complete_ variable is updated to true before the function returns. Simultaneously, the executing ModelStreamInferHandler::Process() function will see the InferHandlerState::complete_ == true and instruct its caller InferHandler::Start() function to release the InferHandlerState object, which is required by the ModelStreamInferHandler::StreamInferResponseComplete() function to finish its execution, for instance, checking for InferHandlerState::IsGrpcContextCancelled().

A possible sequence of actions,

  1. ModelStreamInferHandler::StreamInferResponseComplete() is called.
  2. ModelStreamInferHandler::StreamInferResponseComplete() set InferHandlerState::complete_ = true.
  3. InferHandler::Start() begin its next iteration.
  4. Since InferHandlerState::complete_ == true, InferHandlerState object is released.
  5. ModelStreamInferHandler::StreamInferResponseComplete() continue execution without knowing InferHandlerState object is released.
  6. InferHandlerState object is dereferenced and this action resulted in a segmentation fault.

The issue is easily fixable by having the ModelStreamInferHandler::StreamInferResponseComplete() function to update the InferHandlerState::complete_ variable at the end of its execution, after all accesses to the InferHandlerState object are completed.

Before fix:

I0412 19:57:10.187564 1474 grpc_server.cc:2470] Started GRPCInferenceService at 0.0.0.0:8001
I0412 19:57:10.187729 1474 http_server.cc:4693] Started HTTPService at 0.0.0.0:8000
I0412 19:57:10.228720 1474 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002
Signal (11) received.
 0# 0x0000564C107D8B43 in tritonserver
 1# 0x00007FD194BA9520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x0000564C10839564 in tritonserver
 3# 0x00007FD19560720C in /opt/tritonserver/bin/../lib/libtritonserver.so
 4# TRITONBACKEND_ResponseFactorySendFlags in /opt/tritonserver/bin/../lib/libtritonserver.so
 5# 0x00007FD180736728 in /opt/tritonserver/backends/python/libtriton_python.so
 6# 0x00007FD180737134 in /opt/tritonserver/backends/python/libtriton_python.so
 7# 0x00007FD18074624D in /opt/tritonserver/backends/python/libtriton_python.so
 8# 0x00007FD194C00EE8 in /usr/lib/x86_64-linux-gnu/libc.so.6
 9# 0x00007FD18072DCD0 in /opt/tritonserver/backends/python/libtriton_python.so
10# 0x00007FD18075B55B in /opt/tritonserver/backends/python/libtriton_python.so
11# 0x00007FD18074B997 in /opt/tritonserver/backends/python/libtriton_python.so
12# 0x00007FD18072FBC1 in /opt/tritonserver/backends/python/libtriton_python.so
13# 0x00007FD18074F59D in /opt/tritonserver/backends/python/libtriton_python.so
14# 0x00007FD180745484 in /opt/tritonserver/backends/python/libtriton_python.so
15# 0x00007FD194BFBAC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Segmentation fault (core dumped)
root@tritonserver_qa:/opt/tritonserver# I0412 19:57:13.289491 1637 pb_stub.cc:2119]  Non-graceful termination detected. 
I0412 19:57:13.295702 1644 pb_stub.cc:2119]  Non-graceful termination detected.

After fix:

I0412 19:53:56.744054 612 grpc_server.cc:2470] Started GRPCInferenceService at 0.0.0.0:8001
I0412 19:53:56.744225 612 http_server.cc:4693] Started HTTPService at 0.0.0.0:8000
I0412 19:53:56.785186 612 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002
^CSignal (2) received.
...

@kthui kthui requested review from tanmayv25, rmccorm4 and Tabrizian and removed request for tanmayv25 and rmccorm4 April 12, 2024 21:18
@kthui kthui marked this pull request as ready for review April 12, 2024 21:19
@kthui kthui force-pushed the jacky-grpc-stream branch from 6e9e919 to 330134d Compare April 12, 2024 21:54
state->step_ = Steps::CANCELLED;
state->context_->PutTaskBackToQueue(state);
}

state->complete_ = is_complete;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the root cause - Is the use of local variable sufficient? Or is there some synchronization needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sufficient, because it is only a bool variable and only moves from false to true (unless the function is called with a complete final flag and then called again without the flag, which should not happen). Given the call to this function with the complete final flag is the last call into this function, it is safe to mark the state as complete at the end, so that the gRPC thread can capture the complete signal and release the state.

@kthui kthui force-pushed the jacky-grpc-stream branch from 330134d to e36fec2 Compare April 15, 2024 23:57
@kthui
Copy link
Contributor Author

kthui commented Apr 16, 2024

Added a test case for the race condition, with delayed ModelStreamInferHandler::StreamInferResponseComplete() completion after it sets the state->complete_ to true.

Also discovered when it is delayed, instead of dereferencing a nullptr on the state->context_ object, it can be set to the state->context_ object of the next request, likely due to the state object is reused. The added test also specifically checks for this behavior, for example,

E0415 23:51:28.952381 6289 stream_infer_handler.cc:706] Should not print this! The state context object has changed after delay, pointer before: 0x7f36b8000b80, pointer after: 0x7f36b8002cd0

which the presence of the line above in the server log is checked by the test case.

@kthui kthui requested a review from rmccorm4 April 16, 2024 00:17
Copy link
Contributor

@tanmayv25 tanmayv25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @kthui ! Nice find and fix!

@tanmayv25
Copy link
Contributor

tanmayv25 commented Apr 16, 2024

Also discovered when it is delayed, instead of de-referencing a nullptr on the state->context_ object, it can be set to the state->context_ object of the next request, likely due to the state object is reused.

Your understanding is correct! The state objects can be re-used. It is done for sharing the response protobuf message buffer across responses.

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@kthui kthui merged commit 1fbaf53 into main Apr 17, 2024
3 checks passed
@kthui kthui deleted the jacky-grpc-stream branch April 17, 2024 17:31
kthui added a commit that referenced this pull request Apr 17, 2024
* Fix state complete_ race condition

* Add delay and error checking to StreamInferResponseComplete

* Add test for gRPC decoupled infer complete flag
GuanLuo pushed a commit that referenced this pull request Apr 18, 2024
* Fix state complete_ race condition

* Add delay and error checking to StreamInferResponseComplete

* Add test for gRPC decoupled infer complete flag
mc-nv pushed a commit that referenced this pull request Apr 18, 2024
* Fix state complete_ race condition

* Add delay and error checking to StreamInferResponseComplete

* Add test for gRPC decoupled infer complete flag
pvijayakrish pushed a commit that referenced this pull request Jan 15, 2025
* Fix state complete_ race condition

* Add delay and error checking to StreamInferResponseComplete

* Add test for gRPC decoupled infer complete flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants