You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.12.8 (main, Dec 4 2024, 08:54:12) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-1019-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40S
Nvidia driver version: 560.35.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 1
Stepping: 1
BogoMIPS: 5299.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1.5 MiB (48 instances)
L1i cache: 1.5 MiB (48 instances)
L2 cache: 24 MiB (48 instances)
L3 cache: 192 MiB (6 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.47.1
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.6.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-95 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_VERSION=12.1.0
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
NVIDIA_VISIBLE_DEVICES=all
CUDA_MODULE_LOADING=LAZY
Model Input Dumps
No response
🐛 Describe the bug
We have fine-tuned a model based on TinyLlama/TinyLlama-1.1B-Chat-v1.0. We used llama.cpp to generate a .gguf file for the model.
When we serve the model with llama.cpp, it works.
When we serve the model with vLLM, we get garbage, similar to what was identified in issue #10675.
As you can see above, we are using the latest vLLM (we run it from the docker image on Docker Hub //vllm/vllm-openai:latest).
As an example, this client code:
importopenaiclient=openai.OpenAI(base_url="http://ip-172-41-47-76.us-west-2.compute.internal:8500/v1", api_key="none")
results=client.chat.completions.create(
model='/data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf',
messages=[{
'role': 'user',
'content': '/meraki what alerts are active on network xyz?',
}],
temperature=0.0,
)
print(results)
INFO 01-23 18:10:34 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-23 18:10:34 api_server.py:713] args: Namespace(subparser='serve', model_tag='/data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf', config='', host=None, port=8500, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x77e7c6648d60>)
INFO 01-23 18:10:34 api_server.py:199] Started engine process with PID 679125
INFO 01-23 18:10:37 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-23 18:10:41 config.py:2272] Downcasting torch.float32 to torch.float16.
INFO 01-23 18:10:42 config.py:510] This model supports multiple tasks: {'embed', 'score', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
WARNING 01-23 18:10:42 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Merges were not in checkpoint, building merges on the fly.
INFO 01-23 18:10:46 config.py:510] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
WARNING 01-23 18:10:46 config.py:588] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 01-23 18:10:47 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf', speculative_config=None, tokenizer='/data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
13%|█▎ | 4269/32000 [00:02<00:18, 1481.84it/s]You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
19%|█▉ | 6099/32000 [00:03<00:19, 1309.19it/s]Merges were not in checkpoint, building merges on the fly.
100%|██████████| 32000/32000 [00:27<00:00, 1172.00it/s]
100%|██████████| 32000/32000 [00:27<00:00, 1166.88it/s]
INFO 01-23 18:11:20 selector.py:120] Using Flash Attention backend.
INFO 01-23 18:11:21 model_runner.py:1094] Starting to load model /data/models/trained/rgenter/tinyllama-24-11-20_00-10-44.10epoch.gguf...
/usr/local/lib/python3.12/dist-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 01-23 18:11:24 model_runner.py:1099] Loading model weights took 1.1184 GB
INFO 01-23 18:11:25 worker.py:241] Memory profiling takes 0.49 seconds
INFO 01-23 18:11:25 worker.py:241] the current vLLM instance can use total_gpu_memory (44.42GiB) x gpu_memory_utilization (0.90) = 39.98GiB
INFO 01-23 18:11:25 worker.py:241] model weights take 1.12GiB; non_torch_memory takes 0.07GiB; PyTorch activation peak memory takes 0.30GiB; the rest of the memory reserved for KV Cache is 38.49GiB.
INFO 01-23 18:11:25 gpu_executor.py:76] # GPU blocks: 114661, # CPU blocks: 11915
INFO 01-23 18:11:25 gpu_executor.py:80] Maximum concurrency for 2048 tokens per request: 895.79x
INFO 01-23 18:11:28 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:12<00:00, 2.83it/s]
INFO 01-23 18:11:41 model_runner.py:1535] Graph capturing finished in 12 secs, took 0.22 GiB
INFO 01-23 18:11:41 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 16.41 seconds
Exception in thread Thread-3 (_report_usage_worker):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/usage/usage_lib.py", line 148, in _report_usage_worker
self._report_usage_once(model_architecture, usage_context, extra_kvs)
File "/usr/local/lib/python3.12/dist-packages/vllm/usage/usage_lib.py", line 187, in _report_usage_once
self._write_to_file(data)
File "/usr/local/lib/python3.12/dist-packages/vllm/usage/usage_lib.py", line 216, in _write_to_file
os.makedirs(os.path.dirname(_USAGE_STATS_JSON_PATH), exist_ok=True)
File "<frozen os>", line 225, in makedirs
PermissionError: [Errno 13] Permission denied: '/home/rgenter/.config/vllm'
INFO 01-23 18:11:42 api_server.py:640] Using supplied chat template:
INFO 01-23 18:11:42 api_server.py:640] None
INFO 01-23 18:11:42 launcher.py:19] Available routes are:
INFO 01-23 18:11:42 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 01-23 18:11:42 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 01-23 18:11:42 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 01-23 18:11:42 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 01-23 18:11:42 launcher.py:27] Route: /health, Methods: GET
INFO 01-23 18:11:42 launcher.py:27] Route: /tokenize, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /detokenize, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /v1/models, Methods: GET
INFO 01-23 18:11:42 launcher.py:27] Route: /version, Methods: GET
INFO 01-23 18:11:42 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /pooling, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /score, Methods: POST
INFO 01-23 18:11:42 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [678607]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8500 (Press CTRL+C to quit)
INFO 01-23 18:12:24 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 01-23 18:12:24 logger.py:37] Received request chatcmpl-60b9e13b0bcb480cadcbe59324ce12e4: prompt: '<|user|>\n/meraki what alerts are active on network xyz?<s>\n<|assistant|>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2019, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 01-23 18:12:24 engine.py:267] Added request chatcmpl-60b9e13b0bcb480cadcbe59324ce12e4.
INFO 01-23 18:12:27 metrics.py:467] Avg prompt throughput: 5.8 tokens/s, Avg generation throughput: 203.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO: 172.41.47.71:57808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 01-23 18:12:40 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 76.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 01-23 18:12:50 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
We have fine-tuned a model based on
TinyLlama/TinyLlama-1.1B-Chat-v1.0
. We usedllama.cpp
to generate a .gguf file for the model.When we serve the model with
llama.cpp
, it works.When we serve the model with vLLM, we get garbage, similar to what was identified in issue #10675.
As you can see above, we are using the latest vLLM (we run it from the docker image on Docker Hub //vllm/vllm-openai:latest).
As an example, this client code:
produces this result with
llama.cpp
:while with
vllm
we get:The vllm server output is:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: