Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to /dev/termination-log on main loop exception #118

Merged
merged 1 commit into from
Sep 17, 2024

Conversation

NickLucche
Copy link
Contributor

Addresses https://github.ibm.com/ai-foundation/fmaas-inference-server/issues/722.

Description

Just like it happens in TGIS https://github.com/IBM/text-generation-inference/blob/9388f02d222c0dab695bea1fb595cacdf08d5467/server/text_generation_server/cli.py#L35, it is useful to have termination logs written to /dev/termination-log so that k8s automatically forwards it.

I am not 100% sure this should be brought to vllm-upstream, as I can see them rightfully arguing it should be responsability of the wrapping script, but still I am open to discuss :)

How Has This Been Tested?

Tested by installing this adapter branch onto a dev pod with vllmv0.5.5 and triggering an injected runtime failure.

Exceptions raising from HTTP server are forwarded just fine to termination logs, but the ones during creation of grpc server currently get overshadowed by a uvicorn one.
Workflow is smt like: grpc server exception awaited-> task cancel sent to all servers -> vllm crashes during exit-coroutine await->exceptions are gathered in check_for_failed_tasks->vllm crash only is reported.
Uvicorn stacktrace:

INFO 09-04 16:24:09 launcher.py:67] Gracefully stopping http server
INFO:     Shutting down
INFO 09-04 16:24:09 server.py:228] vLLM ZMQ RPC Server was interrupted.
INFO 09-04 16:24:09 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/http.py", line 52, in run_http_server
    await shutdown_coro
  File "/home/nlucches/work/vllm-tgis-adapter/.venv/lib/python3.10/site-packages/uvicorn/server.py", line 261, in shutdown
    for server in self.servers:
AttributeError: 'Server' object has no attribute 'servers'. Did you mean: 'serve'?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nlucches/.local/share/uv/python/cpython-3.10.14-linux-x86_64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/nlucches/.local/share/uv/python/cpython-3.10.14-linux-x86_64-gnu/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/__main__.py", line 98, in <module>
    raise e
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/__main__.py", line 94, in <module>
    loop.run_until_complete(start_servers(args))
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/__main__.py", line 72, in start_servers
    check_for_failed_tasks(tasks)
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/utils.py", line 27, in check_for_failed_tasks
    raise RuntimeError(f"task={name} ({coro_name}) exception={str(exc)}") from exc
RuntimeError: task=http_server (run_http_server) exception='Server' object has no attribute 'servers'

@joerunde

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@joerunde
Copy link
Contributor

joerunde commented Sep 4, 2024

Thanks @NickLucche!

I am not 100% sure this should be brought to vllm-upstream, as I can see them rightfully arguing it should be responsability of the wrapping script, but still I am open to discuss :)

I think they should be open to it, since vLLM provides an official image that's designed to be ready to deploy. I think it's worth proposing, at least. But from glancing at the diff here it does look like we'd probably still need to implement this functionality separately here since we'd want to catch errors from the grpc server as well.

Regarding testing these changes, it'd be nice to see some examples from misconfiguring the server. I think that's the primary use case here, helping people debug what mistake they made with their deployment to get it up and running. Some common things you can do are:

  • Enable lora on a model format that doesn't support it (the default facebook/opt-125m will fail)
  • Supply lora modules but don't enable lora
  • Set a context length that's longer than what the model supports
  • Deploy an unsupported model architecture (all encoder-decoder models should fail)
  • Deploy a model that's too large to fit in memory

As is, I think most of those should fail when creating the engine, so the exception will be raised before the check_for_failed_tasks check and I think they'd format nicely.

For other errors that crash the server at runtime, I think we'd want to try to output the root cause and not the wrapping like RuntimeError: task=http_server (run_http_server) exception='{blah}'. There are some edge cases that you can use to crash the server, but the easiest way is probably to go and edit the LLMEngine code locally to raise an error when it goes to run the model.

Also, it looks like you're running into issues with the formatter, so you may want to try to installing and enabling pre-commit to ensure everything is formatted correctly

@NickLucche
Copy link
Contributor Author

Thanks a lot for your reviews!
Now, while I work on adding those misconfigurations as tests, the main issue I had with serving-time exceptions is that those happening within the server (AsyncLLMEngine) are in a separate process and I couldn't see any way of surfacing those without touching upstream code.

With the help of @dtrifiro, best thing I have now is yet another check on the state of the AsyncEngineClient, notifying the user to go check the logs where the server process printed its stack trace.
Let me know if you see a better way to do this @joerunde!

FYI: I did try to remove the engine.errored and not engine.is_running check from within the grpc server (along with the stop event) to have a bit cleaner logic -both servers mind their own businesses with the wrapper being able to poke about both- but I see how shutting the grpc server down requires access to a handler the task does not expose.

@NickLucche
Copy link
Contributor Author

I added a way to test for some of those cases you mentioned; getting all cli errors is a bit tricky though as they may happen when the engine is created (in vllm space) but still in a separate serrver process.

@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 56.80%. Comparing base (20db20b) to head (d392e95).

Files with missing lines Patch % Lines
src/vllm_tgis_adapter/utils.py 55.55% 4 Missing ⚠️
src/vllm_tgis_adapter/__main__.py 80.00% 1 Missing and 2 partials ⚠️
tests/test_termination_log.py 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
+ Coverage   55.64%   56.80%   +1.16%     
==========================================
  Files          24       25       +1     
  Lines        1488     1528      +40     
  Branches      269      277       +8     
==========================================
+ Hits          828      868      +40     
+ Misses        583      582       -1     
- Partials       77       78       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@NickLucche
Copy link
Contributor Author

@dtrifiro it seems unrelated to this PR but my tests are sometimes hanging on local when checking out master 7d94eae here ([build(deps): bump mypy from 1.11.1 to 1.11.2]) with:

INFO 09-13 16:45:09 logs.py:155] generate{input=[b'The answer to life the universe ...', b'Medicinal herbs '] prefix_id= correlation_id=None adapter_id= input_chars=[66] params=stopping { max_new_tokens: 10 } tokenization_time=2.47ms queue_time=0.97ms inference_time=2465.34ms time_per_token=246.53ms total_time=2468.77ms input_toks=5}: Sub-request 1 from batch of 2 generated 10 tokens before MAX_TOKENS, output 10 chars: b'          '
.Gracefully stopping gRPC server
INFO 09-13 16:45:09 launcher.py:67] Gracefully stopping http server
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO 09-13 16:45:09 server.py:228] vLLM ZMQ RPC Server was interrupted.
INFO 09-13 16:45:09 async_llm_engine.py:60] Engine is gracefully shutting down.
INFO 09-13 16:45:09 multiproc_worker_utils.py:136] Terminating local vLLM worker processes

Ran them with python -m pytest -s tests 🤔

This includes picking up server config errors, but does NOT
make any attempt in recovering exception stacks happening
inside vllm's RPC server, as they're raised in a separate
process, other than reporting a RuntimeError.
@dtrifiro dtrifiro merged commit 355a088 into opendatahub-io:main Sep 17, 2024
3 checks passed
@NickLucche NickLucche deleted the termination-log branch September 17, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants