Write to /dev/termination-log on main loop exception #118

NickLucche · 2024-09-04T15:04:24Z

Addresses https://github.ibm.com/ai-foundation/fmaas-inference-server/issues/722.

Description

Just like it happens in TGIS https://github.com/IBM/text-generation-inference/blob/9388f02d222c0dab695bea1fb595cacdf08d5467/server/text_generation_server/cli.py#L35, it is useful to have termination logs written to /dev/termination-log so that k8s automatically forwards it.

I am not 100% sure this should be brought to vllm-upstream, as I can see them rightfully arguing it should be responsability of the wrapping script, but still I am open to discuss :)

How Has This Been Tested?

Tested by installing this adapter branch onto a dev pod with vllmv0.5.5 and triggering an injected runtime failure.

Exceptions raising from HTTP server are forwarded just fine to termination logs, but the ones during creation of grpc server currently get overshadowed by a uvicorn one.
Workflow is smt like: grpc server exception awaited-> task cancel sent to all servers -> vllm crashes during exit-coroutine await->exceptions are gathered in check_for_failed_tasks->vllm crash only is reported.
Uvicorn stacktrace:

INFO 09-04 16:24:09 launcher.py:67] Gracefully stopping http server
INFO:     Shutting down
INFO 09-04 16:24:09 server.py:228] vLLM ZMQ RPC Server was interrupted.
INFO 09-04 16:24:09 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/http.py", line 52, in run_http_server
    await shutdown_coro
  File "/home/nlucches/work/vllm-tgis-adapter/.venv/lib/python3.10/site-packages/uvicorn/server.py", line 261, in shutdown
    for server in self.servers:
AttributeError: 'Server' object has no attribute 'servers'. Did you mean: 'serve'?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nlucches/.local/share/uv/python/cpython-3.10.14-linux-x86_64-gnu/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/nlucches/.local/share/uv/python/cpython-3.10.14-linux-x86_64-gnu/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/__main__.py", line 98, in <module>
    raise e
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/__main__.py", line 94, in <module>
    loop.run_until_complete(start_servers(args))
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/__main__.py", line 72, in start_servers
    check_for_failed_tasks(tasks)
  File "/home/nlucches/work/vllm-tgis-adapter/src/vllm_tgis_adapter/utils.py", line 27, in check_for_failed_tasks
    raise RuntimeError(f"task={name} ({coro_name}) exception={str(exc)}") from exc
RuntimeError: task=http_server (run_http_server) exception='Server' object has no attribute 'servers'

@joerunde

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

joerunde · 2024-09-04T23:33:38Z

Thanks @NickLucche!

I am not 100% sure this should be brought to vllm-upstream, as I can see them rightfully arguing it should be responsability of the wrapping script, but still I am open to discuss :)

I think they should be open to it, since vLLM provides an official image that's designed to be ready to deploy. I think it's worth proposing, at least. But from glancing at the diff here it does look like we'd probably still need to implement this functionality separately here since we'd want to catch errors from the grpc server as well.

Regarding testing these changes, it'd be nice to see some examples from misconfiguring the server. I think that's the primary use case here, helping people debug what mistake they made with their deployment to get it up and running. Some common things you can do are:

Enable lora on a model format that doesn't support it (the default facebook/opt-125m will fail)
Supply lora modules but don't enable lora
Set a context length that's longer than what the model supports
Deploy an unsupported model architecture (all encoder-decoder models should fail)
Deploy a model that's too large to fit in memory

As is, I think most of those should fail when creating the engine, so the exception will be raised before the check_for_failed_tasks check and I think they'd format nicely.

For other errors that crash the server at runtime, I think we'd want to try to output the root cause and not the wrapping like RuntimeError: task=http_server (run_http_server) exception='{blah}'. There are some edge cases that you can use to crash the server, but the easiest way is probably to go and edit the LLMEngine code locally to raise an error when it goes to run the model.

Also, it looks like you're running into issues with the formatter, so you may want to try to installing and enabling pre-commit to ensure everything is formatted correctly

src/vllm_tgis_adapter/__main__.py

NickLucche · 2024-09-06T14:18:01Z

Thanks a lot for your reviews!
Now, while I work on adding those misconfigurations as tests, the main issue I had with serving-time exceptions is that those happening within the server (AsyncLLMEngine) are in a separate process and I couldn't see any way of surfacing those without touching upstream code.

With the help of @dtrifiro, best thing I have now is yet another check on the state of the AsyncEngineClient, notifying the user to go check the logs where the server process printed its stack trace.
Let me know if you see a better way to do this @joerunde!

FYI: I did try to remove the engine.errored and not engine.is_running check from within the grpc server (along with the stop event) to have a bit cleaner logic -both servers mind their own businesses with the wrapper being able to poke about both- but I see how shutting the grpc server down requires access to a handler the task does not expose.

NickLucche · 2024-09-11T08:50:33Z

I added a way to test for some of those cases you mentioned; getting all cli errors is a bit tricky though as they may happen when the engine is created (in vllm space) but still in a separate serrver process.

tests/conftest.py

tests/test_termination_log.py

codecov-commenter · 2024-09-13T12:36:16Z

Codecov Report

Attention: Patch coverage is 80.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 56.80%. Comparing base (20db20b) to head (d392e95).

Files with missing lines	Patch %	Lines
src/vllm_tgis_adapter/utils.py	55.55%	4 Missing ⚠️
src/vllm_tgis_adapter/__main__.py	80.00%	1 Missing and 2 partials ⚠️
tests/test_termination_log.py	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
+ Coverage   55.64%   56.80%   +1.16%     
==========================================
  Files          24       25       +1     
  Lines        1488     1528      +40     
  Branches      269      277       +8     
==========================================
+ Hits          828      868      +40     
+ Misses        583      582       -1     
- Partials       77       78       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/vllm_tgis_adapter/__main__.py

tests/conftest.py

tests/test_termination_log.py

NickLucche · 2024-09-13T14:55:23Z

@dtrifiro it seems unrelated to this PR but my tests are sometimes hanging on local when checking out master 7d94eae here ([build(deps): bump mypy from 1.11.1 to 1.11.2]) with:

INFO 09-13 16:45:09 logs.py:155] generate{input=[b'The answer to life the universe ...', b'Medicinal herbs '] prefix_id= correlation_id=None adapter_id= input_chars=[66] params=stopping { max_new_tokens: 10 } tokenization_time=2.47ms queue_time=0.97ms inference_time=2465.34ms time_per_token=246.53ms total_time=2468.77ms input_toks=5}: Sub-request 1 from batch of 2 generated 10 tokens before MAX_TOKENS, output 10 chars: b'          '
.Gracefully stopping gRPC server
INFO 09-13 16:45:09 launcher.py:67] Gracefully stopping http server
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO 09-13 16:45:09 server.py:228] vLLM ZMQ RPC Server was interrupted.
INFO 09-13 16:45:09 async_llm_engine.py:60] Engine is gracefully shutting down.
INFO 09-13 16:45:09 multiproc_worker_utils.py:136] Terminating local vLLM worker processes

Ran them with python -m pytest -s tests 🤔

This includes picking up server config errors, but does NOT make any attempt in recovering exception stacks happening inside vllm's RPC server, as they're raised in a separate process, other than reporting a RuntimeError.

dtrifiro reviewed Sep 5, 2024

View reviewed changes

src/vllm_tgis_adapter/__main__.py Outdated Show resolved Hide resolved

dtrifiro reviewed Sep 5, 2024

View reviewed changes

src/vllm_tgis_adapter/__main__.py Outdated Show resolved Hide resolved

dtrifiro reviewed Sep 5, 2024

View reviewed changes

src/vllm_tgis_adapter/__main__.py Outdated Show resolved Hide resolved

dtrifiro reviewed Sep 11, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

dtrifiro reviewed Sep 11, 2024

View reviewed changes

tests/conftest.py Show resolved Hide resolved

dtrifiro reviewed Sep 11, 2024

View reviewed changes

tests/test_termination_log.py Outdated Show resolved Hide resolved

NickLucche requested a review from joerunde September 11, 2024 13:49

NickLucche force-pushed the termination-log branch from f16dc48 to ef028f1 Compare September 13, 2024 12:29

dtrifiro reviewed Sep 13, 2024

View reviewed changes

src/vllm_tgis_adapter/__main__.py Show resolved Hide resolved

dtrifiro reviewed Sep 13, 2024

View reviewed changes

tests/conftest.py Show resolved Hide resolved

dtrifiro reviewed Sep 13, 2024

View reviewed changes

tests/test_termination_log.py Outdated Show resolved Hide resolved

dtrifiro approved these changes Sep 13, 2024

View reviewed changes

dtrifiro force-pushed the termination-log branch from 6737b90 to 3bc15b1 Compare September 16, 2024 10:29

write to /dev/termination-log on main loop exception.

d392e95

This includes picking up server config errors, but does NOT make any attempt in recovering exception stacks happening inside vllm's RPC server, as they're raised in a separate process, other than reporting a RuntimeError.

NickLucche force-pushed the termination-log branch from 3bc15b1 to d392e95 Compare September 16, 2024 12:19

dtrifiro merged commit 355a088 into opendatahub-io:main Sep 17, 2024
3 checks passed

NickLucche deleted the termination-log branch September 17, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write to /dev/termination-log on main loop exception #118

Write to /dev/termination-log on main loop exception #118

NickLucche commented Sep 4, 2024

joerunde commented Sep 4, 2024

NickLucche commented Sep 6, 2024

NickLucche commented Sep 11, 2024

codecov-commenter commented Sep 13, 2024 •

edited

Loading

NickLucche commented Sep 13, 2024

Write to /dev/termination-log on main loop exception #118

Write to /dev/termination-log on main loop exception #118

Conversation

NickLucche commented Sep 4, 2024

Description

How Has This Been Tested?

Merge criteria:

joerunde commented Sep 4, 2024

NickLucche commented Sep 6, 2024

NickLucche commented Sep 11, 2024

codecov-commenter commented Sep 13, 2024 • edited Loading

Codecov Report

NickLucche commented Sep 13, 2024

codecov-commenter commented Sep 13, 2024 •

edited

Loading