Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] vllm推理qwen-72b-chat返回异常 #728

Closed
2 tasks done
a1164714 opened this issue Dec 5, 2023 · 9 comments
Closed
2 tasks done

[BUG] vllm推理qwen-72b-chat返回异常 #728

a1164714 opened this issue Dec 5, 2023 · 9 comments
Assignees

Comments

@a1164714
Copy link

a1164714 commented Dec 5, 2023

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

input:

你好?

output:

你好!很高兴为你提供帮助。<|im_end|>
<|endoftext|>Human:<|im_end|>
<|im_start|>
<|im_start|><|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>
<|im_start|>

期望行为 | Expected Behavior

output:

你好!很高兴为你提供帮助。

复现方法 | Steps To Reproduce

1 run command

python -m vllm.entrypoints.openai.api_server --model ./models--Qwen--Qwen-72B-Chat/snapshots/87272d8b8fabbdd0727c376fe0271f0b5cd10b24 --host 0.0.0.0 --port 8081 --trust-remote-code --served-model-name qwen-72b-chat --tensor-parallel-size=4 --gpu-memory-utilization 0.98 --dtype bfloat16

2 log

INFO 12-05 12:59:08 api_server.py:711] args: Namespace(host='0.0.0.0', port=8081, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name='qwen-72b-chat', chat_template=None, response_role='assistant', model='./models--Qwen--Qwen-72B-Chat/snapshots/87272d8b8fabbdd0727c376fe0271f0b5cd10b24', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='bfloat16', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.98, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2023-12-05 12:59:09,980 WARNING utils.py:581 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2023-12-05 12:59:10,265 INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-05 12:59:11 llm_engine.py:73] Initializing an LLM engine with config: model='./models--Qwen--Qwen-72B-Chat/snapshots/87272d8b8fabbdd0727c376fe0271f0b5cd10b24', tokenizer='./models--Qwen--Qwen-72B-Chat/snapshots/87272d8b8fabbdd0727c376fe0271f0b5cd10b24', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=None, seed=0)
WARNING 12-05 12:59:11 tokenizer.py:79] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 12-05 13:00:10 llm_engine.py:222] # GPU blocks: 203, # CPU blocks: 409
WARNING 12-05 13:00:14 tokenizer.py:79] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 12-05 13:00:14 api_server.py:115] No chat template provided. Chat API will not work.
INFO:     Started server process [15094]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)

No chat template is defined for this tokenizer - using a default chat template that implements the ChatML format. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.
INFO 12-05 13:01:44 async_llm_engine.py:379] Received request cmpl-6366bff6da6149b7b2f8b994f0347c20: prompt: '<|im_start|>user\n\n您好?\n<|im_end|>\n<|im_start|>assistant\n', sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=128, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt token ids: [151644, 872, 271, 111308, 94432, 151645, 198, 151644, 77091, 198].
INFO 12-05 13:01:44 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%
INFO 12-05 13:01:49 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.4%, CPU KV cache usage: 0.0%
INFO 12-05 13:01:49 async_llm_engine.py:111] Finished request cmpl-6366bff6da6149b7b2f8b994f0347c20.
INFO:     11.176.16.154:27854 - "POST /v1/chat/completions HTTP/1.1" 200 OK

运行环境 | Environment

- OS: centos7
- Python: 3.10
- Transformers: 4.35.0
- PyTorch: 2.1.0
- CUDA 12.1
- vllm 0.2.3

备注 | Anything else?

No response

@jklj077
Copy link
Contributor

jklj077 commented Dec 5, 2023

您需要配合FastChat使用vLLM,请参照README中说明。vLLM本身并未提供对Qwen对话模型上的支持。

@positive666
Copy link

+1按照VLLM官方和readme都复现了类似的问题(在VLLM api部署后chat版本中1.8B是这样的 。7B和14B是正常的),且LORA后的在VLLM框架下输出被截断,不带VLLM框架输出正常

@jklj077
Copy link
Contributor

jklj077 commented Dec 13, 2023

@positive666 有调temperature/topp什么的吗?

@magnificent1208
Copy link

请问最后解决了吗

@Modas-Li
Copy link

same issue, 意外截断

@hzhwcmhf
Copy link
Member

@Modas-Li @magnificent1208 模型输出截断问题可能有各种不同原因(楼主应该就是没用fastchat导致的),如果有问题可以单开isseu或者在这个帖子下回复出问题的具体情况(比如模型大小,启动方式,模型输出,解码参数,命令行输出等)

@positive666
Copy link

@positive666 有调temperature/topp什么的吗?

走的是VLLM的官方命令 应该都是缺省值,通过FASTCHAT和VLLM的也一样复现,我个人猜测如果是VLLM的问题可能是VLLM generate的函数代码那里,手头忙完活 再看看

@Jankin66
Copy link

Jankin66 commented Jan 1, 2024

遇到了同样的问题,请问解决了吗

@jklj077
Copy link
Contributor

jklj077 commented Jan 2, 2024

根据您的样例:

  • 推理配置为
    sampling params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], ignore_eos=False, max_tokens=128, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
    
  • 输入文本为
    <|im_start|>user\n\n您好?\n<|im_end|>\n<|im_start|>assistant\n
    
  • 输入token id为
    [151644, 872, 271, 111308, 94432, 151645, 198, 151644, 77091, 198]
    

配置有以下问题

  1. temperature=0.1, top_p=1.0,Qwen模型请调top_p而非temperature
  2. length_penalty,建议配置为1.1
  3. 未传入stop_token_ids

输入有以下问题

  1. 缺少system prompt

tokenize有以下问题

  1. 未单独tokenize角色及\n,导致\n\n合并为一个token,模板格式错乱;实际tokenize成了
    [<|im_start|>, 'user', '\n\n', '您好', '?\n', <|im_end|>, '\n', <|im_start|>, 'assistant', '\n']
    
    应为
    [<|im_start|>, 'user', '\n', '\n', '您好', '?\n', <|im_end|>, '\n', <|im_start|>, 'assistant', '\n']
    

不建议单独使用vLLM。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants