-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-decoupled model to send response and FINAL flag separately #6017
Conversation
src/http_server.cc
Outdated
|
||
// Defer sending the response until FINAL flag is seen or | ||
// there is error | ||
if ((err == nullptr) && (flags & TRITONSERVER_RESPONSE_COMPLETE_FINAL) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen if there is an error in when calling TRITONBACKEND_ResponseSend()
? Would the error be returned to the client and there would be no information sent to the client on TRITONBACKEND_ResponseFactorySendFlags
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change to not send error until FINAL flag is seen, otherwise there can be issue on the backend trying to call the response send function again after the userp
has been released. Currently the only assumption is that the backend will not call the function again once final flag is sent.
bcfa971
to
5f21ff1
Compare
@GuanLuo do we still plan to implement this to help unify decoupled/non-decoupled logic in backends like RIVA/TRTLLM etc? |
2f787bd
to
d9831d9
Compare
src/grpc/infer_handler.cc
Outdated
} | ||
|
||
state->context_->EraseInflightState(state); | ||
|
||
#ifdef TRITON_ENABLE_TRACING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that INFER_RESPONSE_COMPLETE
multiple capture issue will be happening here, since deferring the response send
is happening later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delayed
…with the meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TMLG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have some kind of test that asserts the traces come out correctly to avoid future breakage, but won't block the PR on it
@rmccorm4 , I've created a ticket for this: DLIS-6308 |
Follow up on triton-inference-server/core#229
For custom backend, one may call send response in the following style, even for "non-decoupled" model
This yields single response of the request from the client's perspective and would expect a successful inference through HTTP / non-streaming GRPC as it is "non-decoupled". The previous implementation is restricted that it simply assume decoupled use case when response complete callback is invoked multiple time. The change is to relax the restriction and collapse the above into single response to the client.