Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[CI/Build] Basic server correctness test #237

Merged
merged 5 commits into from
May 29, 2024
Merged

Conversation

derekk-nm
Copy link

Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with AutoModelForCausalLM.from_pretrained().

Updates HfRunner() to accept a HuggingFace access token to be able to retrieve models that are restricted access.

The new HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text). This included a new _decode_token_by_position_index() method to properly calculate the token string by using a lookback on the generated tokens list.

Enhances the output of the check_logprobs_close() function to provide more details about the failing tokens.

Adds the test to the appropriate skip-*.txt files so that this long running test won’t get automatically run during automatic dev push workflows.

To run this test manually;
[assumes that you’ve downloaded and installed the local nm-vllm package with pip install -e .[sparse] and all of the packages from requirements-common.txt, reqirements-cuda.txt, and requirements-dev.txt]

  • Define the HF_TOKEN environment variable with a valid HuggingFace access token
  • cd to the nm-vllm directory
  • Run the test with the command:
    -- python3 -m pytest --forked tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server

[note that running this from my local env I needed to include the “--import-mode importlib“ option to workaround a known issue in vllm]

@derekk-nm
Copy link
Author

This test is failing today. Something's been broken over the weekend. The exception is:

==== server startup command args ====
--model mistralai/Mistral-7B-Instruct-v0.2 --max-model-len 4096 --disable-log-requests --tensor-parallel-size 2 --dtype half
====
(ServerRunner pid=1801782) Traceback (most recent call last):
(ServerRunner pid=1801782)   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(ServerRunner pid=1801782)     return _run_code(code, main_globals, None,
(ServerRunner pid=1801782)   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
(ServerRunner pid=1801782)     exec(code, run_globals)
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/entrypoints/openai/api_server.py", line 23, in <module>
(ServerRunner pid=1801782)     from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/entrypoints/openai/serving_chat.py", line 15, in <module>
(ServerRunner pid=1801782)     from vllm.model_executor.guided_decoding import (
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/model_executor/guided_decoding/__init__.py", line 5, in <module>
(ServerRunner pid=1801782)     from vllm.model_executor.guided_decoding.lm_format_enforcer_decoding import (
(ServerRunner pid=1801782)   File "/network/derekk/testdev1/nm-vllm/vllm/model_executor/guided_decoding/lm_format_enforcer_decoding.py", line 5, in <module>
(ServerRunner pid=1801782)     from lmformatenforcer import (CharacterLevelParser, JsonSchemaParser,
(ServerRunner pid=1801782) ModuleNotFoundError: No module named 'lmformatenforcer'

@derekk-nm
Copy link
Author

I don't understand why the build was skipped. I didn't try to skip it.

@dbarbuzzi
Copy link

A couple of notes:

  • The one test failure in the remote-push job is an intermittent marlin-related failure
  • I ran these tests for the magic-wand and nm-vllm RCs for release testing and they both passed

@derekk-nm derekk-nm force-pushed the basic_server_correctness branch from b00a664 to ba3866a Compare May 20, 2024 12:09
@derekk-nm
Copy link
Author

After rebasing this branch onto main, the test is passing for me with the single Mistral model:

/root/pyvenv/nmv1/bin/python3 -m pytest --forked --import-mode importlib tests/basic_correctness/test_basic_server_correctness.py -k test_models_on_server 
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /network/derekk/testdev1/nm-vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, forked-1.6.0, anyio-4.3.0, shard-0.1.2, asyncio-0.23.7
asyncio: mode=strict
collected 2 items
Running 2 items in this shard

tests/basic_correctness/test_basic_server_correctness.py ..              [100%]

======================== 2 passed in 767.36s (0:12:47) =========================

@derekk-nm
Copy link
Author

derekk-nm commented May 22, 2024

Per Slack discussions, I've updated the test to include most of the remaining models in the test execution (some need to be skipped if the model requires a GPU device capability greater than that available on the GPU under test). It was also necessary to ignore "special tokens" output by the HuggingFace runner for a few prompts in a number of models. The practice to simply convert any special token to an empty string worked for all but one test:

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /network/derekk/testdev1/nm-vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, forked-1.6.0, anyio-4.3.0, shard-0.1.2, asyncio-0.23.7
asyncio: mode=strict
collected 20 items
Running 20 items in this shard

tests/basic_correctness/test_basic_server_correctness.py ......Fsss..... [ 75%]
.Fsss                                                                    [100%]

....
=========================== short test summary info ============================
FAILED tests/basic_correctness/test_basic_server_correctness.py::test_models_on_server[None-3-32-microsoft/phi-2-2048-None-None]
FAILED tests/basic_correctness/test_basic_server_correctness.py::test_models_on_server[2-3-32-microsoft/phi-2-2048-None-None]
============= 2 failed, 12 passed, 6 skipped in 6544.64s (1:49:04) =============

the failure is the same for both executions with the same model:

E                   AssertionError: hf_model token '! Here’' not in [['’', '‘', '”']]
E                   prompt index 23, token index 4:
E                   hf_model:	'Absolutely! Here! Here’s an updated version of the essay that includes a few more anecdotes:\n\n<|im_start|>user\nWrite a'
E                   vllm_model:	'Absolutely! Here’s an updated version of the essay that includes a few more anecdotes:\n\nMy friendship with Sarah began in the tenth grade, during'

The HuggingFace response in this case w/out the hack had this error:

E                   AssertionError: hf_model token '�' not in [['', "'s", ' are']]
E                   prompt index 23, token index 3:
E                   hf_model:	'Absolutely! Here�! Here’s an updated version of the essay that includes a few more anecdotes:\n\n<|im_start|>user\nWrite a'
E                   vllm_model:	'Absolutely! Here’s an updated version of the essay that includes a few more anecdotes:\n\nI met Sarah in the tenth grade during a challenging time'

So, it's not really related to the special token

derekk-nm added 3 commits May 28, 2024 11:25
Introducing an end-to-end test case that verifies basic correctness of the vllm server by comparing the tokens output by the vllm OpenAI server with tokens generated by the HuggingFace model created with AutoModelForCausalLM.from_pretrained().

Updates HfRunner() to accept a HuggingFace access token to be able to retrieve models that are restricted access
The new HfRunnerNM.generate_greedy_logprobs_nm_use_tokens() allows us to compare the HuggingFace generated results (which reports logprobs with token ids) with that from the vllm OpenAI Server (which reports logprobs with token text).  This included a new _decode_token_by_position_index() method to properly calculate the token string by using a lookback on the generated tokens list.
Enhances the output of the check_logprobs_close() function to provide more details about the failing tokens.
Adds the test to the appropriate skip-*.txt files so that this long running test won’t get automatically run during automatic dev push workflows.
Test other models.
Skip execution if the model requires a GPU device capability greater than that available on the current device (reusing approach from test_gptq_marlin.py).
adds a hack to ignore special tokens after decode of HuggingFace response so that we can fairly compare with vllm server response.
this model fails the test with a specific prompt.  to be addressed later.
@derekk-nm derekk-nm force-pushed the basic_server_correctness branch from a9451b9 to 67acb7f Compare May 28, 2024 15:25
@derekk-nm derekk-nm marked this pull request as ready for review May 28, 2024 15:27
@derekk-nm
Copy link
Author

I've rebased this to latest nm-vllm/main. At this point, the test includes a number of models, but skips a few that don't work w/ HuggingFace out of the box, and one that fails the test for a specific prompt. I've got Asana tickets to address these later, so that we can get this committed and running now.

Copy link
Member

@andy-neuma andy-neuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool.

@andy-neuma
Copy link
Member

@derekk-nm could you add a README in "neuralmagic" or "neuralmagic/tests" that outlines:

  • the goal of these tests (this can be rather brief, but should be enough for other folks to understand)
  • how to add remove models

Copy link
Member

@andy-neuma andy-neuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

derekk-nm added 2 commits May 28, 2024 22:30
entries have been moved to the bug report, where failing models will be tracked.
removed some additional models that do not work in the build/test env (until a resolution is found)
expanded doc on the test case
added a README for the *_skip.txt files.
adding tests/basic_correctness/test_basic_server_correctness.py to skip-for-remote-push-tmp.txt
@derekk-nm derekk-nm merged commit f687019 into main May 29, 2024
12 checks passed
@derekk-nm derekk-nm deleted the basic_server_correctness branch May 29, 2024 13:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants