-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write to /dev/termination-log on main loop exception #118
Conversation
Thanks @NickLucche!
I think they should be open to it, since vLLM provides an official image that's designed to be ready to deploy. I think it's worth proposing, at least. But from glancing at the diff here it does look like we'd probably still need to implement this functionality separately here since we'd want to catch errors from the grpc server as well. Regarding testing these changes, it'd be nice to see some examples from misconfiguring the server. I think that's the primary use case here, helping people debug what mistake they made with their deployment to get it up and running. Some common things you can do are:
As is, I think most of those should fail when creating the engine, so the exception will be raised before the For other errors that crash the server at runtime, I think we'd want to try to output the root cause and not the wrapping like Also, it looks like you're running into issues with the formatter, so you may want to try to installing and enabling pre-commit to ensure everything is formatted correctly |
Thanks a lot for your reviews! With the help of @dtrifiro, best thing I have now is yet another check on the state of the FYI: I did try to remove the |
I added a way to test for some of those cases you mentioned; getting all cli errors is a bit tricky though as they may happen when the engine is created (in vllm space) but still in a separate serrver process. |
f16dc48
to
ef028f1
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #118 +/- ##
==========================================
+ Coverage 55.64% 56.80% +1.16%
==========================================
Files 24 25 +1
Lines 1488 1528 +40
Branches 269 277 +8
==========================================
+ Hits 828 868 +40
+ Misses 583 582 -1
- Partials 77 78 +1 ☔ View full report in Codecov by Sentry. |
@dtrifiro it seems unrelated to this PR but my tests are sometimes hanging on local when checking out master INFO 09-13 16:45:09 logs.py:155] generate{input=[b'The answer to life the universe ...', b'Medicinal herbs '] prefix_id= correlation_id=None adapter_id= input_chars=[66] params=stopping { max_new_tokens: 10 } tokenization_time=2.47ms queue_time=0.97ms inference_time=2465.34ms time_per_token=246.53ms total_time=2468.77ms input_toks=5}: Sub-request 1 from batch of 2 generated 10 tokens before MAX_TOKENS, output 10 chars: b' '
.Gracefully stopping gRPC server
INFO 09-13 16:45:09 launcher.py:67] Gracefully stopping http server
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO 09-13 16:45:09 server.py:228] vLLM ZMQ RPC Server was interrupted.
INFO 09-13 16:45:09 async_llm_engine.py:60] Engine is gracefully shutting down.
INFO 09-13 16:45:09 multiproc_worker_utils.py:136] Terminating local vLLM worker processes
Ran them with |
6737b90
to
3bc15b1
Compare
This includes picking up server config errors, but does NOT make any attempt in recovering exception stacks happening inside vllm's RPC server, as they're raised in a separate process, other than reporting a RuntimeError.
3bc15b1
to
d392e95
Compare
Addresses https://github.ibm.com/ai-foundation/fmaas-inference-server/issues/722.
Description
Just like it happens in TGIS https://github.com/IBM/text-generation-inference/blob/9388f02d222c0dab695bea1fb595cacdf08d5467/server/text_generation_server/cli.py#L35, it is useful to have termination logs written to
/dev/termination-log
so that k8s automatically forwards it.I am not 100% sure this should be brought to vllm-upstream, as I can see them rightfully arguing it should be responsability of the wrapping script, but still I am open to discuss :)
How Has This Been Tested?
Tested by installing this adapter branch onto a dev pod with vllmv0.5.5 and triggering an injected runtime failure.
Exceptions raising from HTTP server are forwarded just fine to termination logs, but the ones during creation of grpc server currently get overshadowed by a uvicorn one.
Workflow is smt like: grpc server exception awaited-> task cancel sent to all servers -> vllm crashes during exit-coroutine await->exceptions are gathered in
check_for_failed_tasks
->vllm crash only is reported.Uvicorn stacktrace:
@joerunde
Merge criteria: