-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gRPC frontend race condition #7110
Conversation
6e9e919
to
330134d
Compare
state->step_ = Steps::CANCELLED; | ||
state->context_->PutTaskBackToQueue(state); | ||
} | ||
|
||
state->complete_ = is_complete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the root cause - Is the use of local variable sufficient? Or is there some synchronization needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is sufficient, because it is only a bool variable and only moves from false
to true
(unless the function is called with a complete final flag and then called again without the flag, which should not happen). Given the call to this function with the complete final flag is the last call into this function, it is safe to mark the state as complete at the end, so that the gRPC thread can capture the complete signal and release the state.
330134d
to
e36fec2
Compare
Added a test case for the race condition, with delayed Also discovered when it is delayed, instead of dereferencing a
which the presence of the line above in the server log is checked by the test case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @kthui ! Nice find and fix!
Your understanding is correct! The state objects can be re-used. It is done for sharing the response protobuf message buffer across responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
* Fix state complete_ race condition * Add delay and error checking to StreamInferResponseComplete * Add test for gRPC decoupled infer complete flag
* Fix state complete_ race condition * Add delay and error checking to StreamInferResponseComplete * Add test for gRPC decoupled infer complete flag
* Fix state complete_ race condition * Add delay and error checking to StreamInferResponseComplete * Add test for gRPC decoupled infer complete flag
* Fix state complete_ race condition * Add delay and error checking to StreamInferResponseComplete * Add test for gRPC decoupled infer complete flag
There is a race condition with the
InferHandlerState::complete_
variable. WhenModelStreamInferHandler::StreamInferResponseComplete()
function is called withTRITONSERVER_RESPONSE_COMPLETE_FINAL
flag, theInferHandlerState::complete_
variable is updated totrue
before the function returns. Simultaneously, the executingModelStreamInferHandler::Process()
function will see theInferHandlerState::complete_ == true
and instruct its callerInferHandler::Start()
function to release theInferHandlerState
object, which is required by theModelStreamInferHandler::StreamInferResponseComplete()
function to finish its execution, for instance, checking forInferHandlerState::IsGrpcContextCancelled()
.A possible sequence of actions,
ModelStreamInferHandler::StreamInferResponseComplete()
is called.ModelStreamInferHandler::StreamInferResponseComplete()
setInferHandlerState::complete_ = true
.InferHandler::Start()
begin its next iteration.InferHandlerState::complete_ == true
,InferHandlerState
object is released.ModelStreamInferHandler::StreamInferResponseComplete()
continue execution without knowingInferHandlerState
object is released.InferHandlerState
object is dereferenced and this action resulted in a segmentation fault.The issue is easily fixable by having the
ModelStreamInferHandler::StreamInferResponseComplete()
function to update theInferHandlerState::complete_
variable at the end of its execution, after all accesses to theInferHandlerState
object are completed.Before fix:
After fix: