-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Meta-Llama-3-3-70B-Instruct Outputs "!!!!" With Context Length above 10k #738
Comments
@ppatel-eng thank you for submitting the issue. For now Llama 3.3 is not fully validated by the team. Any feedback is valuable but we need some time to put the model on the official list of supported models. wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py For security purposes, please feel free to check the contents of collect_env.py before running it. |
Understood, thanks! The results from collect_env.py is below:
|
Hi @ppatel-eng, thank you for the update, it seems that vllm-fork for Gaudi is not installed. Please try the steps and run test once again: $ git clone https://github.com/HabanaAI/vllm-fork.git and see if that helps? |
Hi, I tried that on the Llama 3.3 70B as well as on a Llama 3.1 70B and am seeing similar issues on both:
What is interesting is that once one of these issues occurs, subsequent smaller requests (<300 tokens) exhibit the same behavior. It appears the model gets "stuck" generating because the logs show
|
Other flags setting is depending on context length. 32K context length flags example: decreasing of VLLM_GRAPH_RESERVED_MEM, depends on model and long context. VLLM_GRAPH_RESERVED_MEM=0.02 for llama3.1-8b. VLLM_GRAPH_RESERVED_MEM=0.1 for llama3.1-70b.
|
I have a same issue. In my case, context length is less than 8K, sometimes the error occurs. (TP Size 2 or 4) |
@lkm2835 we would like to reproduce the test. Please provide all steps which You are using to run benchmark. We must be sure that we are running exactly the same procedure. Thanks. |
@PatrykWo
And, If you run heavy inference (any datasets) for a few days, using chat completion, the model |
Your current environment
Environment Details
Running in a Kubernetes environment with Habana Gaudi2 accelerators:
Hardware: Habana Gaudi2 accelerators
Deployment: Kubernetes cluster
Node Resources:
Gaudi Habana Version: 1.18
vLLM Version: 0.6.2+geb0d42fc
Python Version: 3.10
How would you like to use vllm
I would like to serve the Meta-Llama-3-3-70B-Instruct model.
Current Configuration
Meta-Llama-3-3-70B-Instruct:
arguments:
- --gpu-memory-utilization 0.90
- --max-logprobs 5
- --enable-auto-tool-choice
- --tool-call-parser llama3_json
- --download-dir /data
- --tensor-parallel-size 4
- --chat-template /data/chat_templates/tool_chat_template_llama31_json.jinja
gpuLimit: 1
numGPU: 4
Model Input Dumps
No response
🐛 Describe the bug
When we provide a context over 10k tokens but sometimes with as little as 3k tokens, we get a response where the model starts outputting exclamation points instead. We tested this same script with the same model on Nvidia A100s and did not see this issue testing up to 60k tokens despite serving the model with the same exact vLLM settings (vllm version 0.6.2).
Example Response:
The text was updated successfully, but these errors were encountered: