Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Fail to use deepseek-vl2 #12118

Closed
1 task done
gystar opened this issue Jan 16, 2025 · 17 comments · Fixed by #12143
Closed
1 task done

[Bug]: Fail to use deepseek-vl2 #12118

gystar opened this issue Jan 16, 2025 · 17 comments · Fixed by #12143
Labels
bug Something isn't working

Comments

@gystar
Copy link

gystar commented Jan 16, 2025

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
GPU 2: NVIDIA A800 80GB PCIe
GPU 3: NVIDIA A800 80GB PCIe

Nvidia driver version: 560.35.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8488C
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             8
CPU max MHz:                          3800.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4800.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            4.5 MiB (96 instances)
L1i cache:                            3 MiB (96 instances)
L2 cache:                             192 MiB (96 instances)
L3 cache:                             210 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-47,96-143
NUMA node1 CPU(s):                    48-95,144-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] open_clip_torch==2.30.0
[pip3] pyzmq==26.2.0
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.49.0.dev0
[pip3] triton==3.1.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] open-clip-torch           2.30.0                   pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.5.1                    pypi_0    pypi
[conda] torchaudio                2.5.1                    pypi_0    pypi
[conda] torchvision               0.20.1                   pypi_0    pypi
[conda] transformers              4.49.0.dev0              pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.6.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    SYS     SYS     NODE    0-47,96-143     0               N/A
GPU1    NODE     X      SYS     SYS     NODE    0-47,96-143     0               N/A
GPU2    SYS     SYS      X      NODE    SYS     48-95,144-191   1               N/A
GPU3    SYS     SYS     NODE     X      SYS     48-95,144-191   1               N/A
NIC0    NODE    NODE    SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
LD_LIBRARY_PATH=/home/xx/miniconda3/envs/xx/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

I run the original example of deepseek-vl2 in the documentation:

from argparse import ArgumentParser
from typing import List, Dict
import torch
from transformers import AutoModelForCausalLM
import PIL.Image
import random

import random

import os,sys

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.assets.video import VideoAsset
from vllm.utils import FlexibleArgumentParser


os.environ['https_proxy'] = 'http://127.0.0.1:7890'
os.environ['http_proxy'] = 'http://127.0.0.1:7890'


os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
# Deepseek-VL2
def run_deepseek_vl2(question: str, modality: str):
    assert modality == "image"

    #model_name = "deepseek-ai/deepseek-vl2"
    model_name="/home/xxxx/model/deepseek-vl2"

    llm = LLM(model=model_name,
              max_model_len=4096,
              max_num_seqs=2,
              disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
              hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]})

    prompt = f"<|User|>: <image>\n{question}\n\n<|Assistant|>:"
    stop_token_ids = None
    return llm, prompt, stop_token_ids


def get_multi_modal_input(args):
    """
    return {
        "data": image or video,
        "question": question,
    }
    """
    if args.modality == "image":
        # Input image and question
        image = ImageAsset("cherry_blossom") \
            .pil_image.convert("RGB")
        img_question = "What is the content of this image?"

        return {
            "data": image,
            "question": img_question,
        }

    if args.modality == "video":
        # Input video and question
        video = VideoAsset(name="sample_demo_1.mp4",
                           num_frames=args.num_frames).np_ndarrays
        vid_question = "Why is this video funny?"

        return {
            "data": video,
            "question": vid_question,
        }

    msg = f"Modality {args.modality} is not supported."
    raise ValueError(msg)


def apply_image_repeat(image_repeat_prob, num_prompts, data, prompt, modality):
    """Repeats images with provided probability of "image_repeat_prob". 
    Used to simulate hit/miss for the MM preprocessor cache.
    """
    assert (image_repeat_prob <= 1.0 and image_repeat_prob >= 0)
    no_yes = [0, 1]
    probs = [1.0 - image_repeat_prob, image_repeat_prob]

    inputs = []
    cur_image = data
    for i in range(num_prompts):
        if image_repeat_prob is not None:
            res = random.choices(no_yes, probs)[0]
            if res == 0:
                # No repeat => Modify one pixel
                cur_image = cur_image.copy()
                new_val = (i // 256 // 256, i // 256, i % 256)
                cur_image.putpixel((0, 0), new_val)

        inputs.append({
            "prompt": prompt,
            "multi_modal_data": {
                modality: cur_image
            }
        })

    return inputs

def main(args):
    modality = args.modality
    mm_input = get_multi_modal_input(args)
    data = mm_input["data"]
    question = mm_input["question"]

    llm, prompt, stop_token_ids = run_deepseek_vl2(question, modality)

    # We set temperature to 0.2 so that outputs can be different
    # even when all prompts are identical when running batch inference.
    sampling_params = SamplingParams(temperature=0.2,
                                     max_tokens=64,
                                     stop_token_ids=stop_token_ids)

    assert args.num_prompts > 0
    if args.num_prompts == 1:
        # Single inference
        inputs = {
            "prompt": prompt,
            "multi_modal_data": {
                modality: data
            },
        }

    else:
        # Batch inference
        if args.image_repeat_prob is not None:
            # Repeat images with specified probability of "image_repeat_prob"
            inputs = apply_image_repeat(args.image_repeat_prob,
                                        args.num_prompts, data, prompt,
                                        modality)
        else:
            # Use the same image for all prompts
            inputs = [{
                "prompt": prompt,
                "multi_modal_data": {
                    modality: data
                },
            } for _ in range(args.num_prompts)]

    if args.time_generate:
        import time
        start_time = time.time()
        outputs = llm.generate(inputs, sampling_params=sampling_params)
        elapsed_time = time.time() - start_time
        print("-- generate time = {}".format(elapsed_time))

    else:
        outputs = llm.generate(inputs, sampling_params=sampling_params)

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)



if __name__ == "__main__":
    parser = FlexibleArgumentParser(
        description='Demo on using vLLM for offline inference with '
        'vision language models for text generation')
    parser.add_argument('--num-prompts',
                        type=int,
                        default=4,
                        help='Number of prompts to run.')
    parser.add_argument('--modality',
                        type=str,
                        default="image",
                        choices=['image', 'video'],
                        help='Modality of the input.')
    parser.add_argument('--num-frames',
                        type=int,
                        default=16,
                        help='Number of frames to extract from the video.')

    parser.add_argument(
        '--image-repeat-prob',
        type=float,
        default=None,
        help='Simulates the hit-ratio for multi-modal preprocessor cache'
        ' (if enabled)')

    parser.add_argument(
        '--disable-mm-preprocessor-cache',
        action='store_true',
        help='If True, disables caching of multi-modal preprocessor/mapper.')

    parser.add_argument(
        '--time-generate',
        action='store_true',
        help='If True, then print the total generate() call time')

    args = parser.parse_args()
    main(args)

but got the following issue:

Traceback (most recent call last):
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1073, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 775, in __getitem__
    raise KeyError(key)
KeyError: 'deepseek_vl_v2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xxxxx/xxxxx/DeepSeek-VL2/direct_acc.py", line 298, in <module>
    main(args)
  File "/home/xxxxx/xxxxx/DeepSeek-VL2/direct_acc.py", line 212, in main
    llm, prompt, stop_token_ids = run_deepseek_vl2(question, modality)
  File "/home/xxxxx/xxxxx/DeepSeek-VL2/direct_acc.py", line 134, in run_deepseek_vl2
    llm = LLM(model=model_name,
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/utils.py", line 986, in inner
    return fn(*args, **kwargs)
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 230, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 514, in from_engine_args
    engine_config = engine_args.create_engine_config(usage_context)
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1044, in create_engine_config
    model_config = self.create_model_config()
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 970, in create_model_config
    return ModelConfig(
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/config.py", line 276, in __init__
    hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 239, in get_config
    raise e
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/transformers_utils/config.py", line 219, in get_config
    config = AutoConfig.from_pretrained(
  File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1075, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `deepseek_vl_v2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

The deepseek-vl2 model have not been included in the newest version of transformers, but how can we use it with vllm?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@gystar gystar added the bug Something isn't working label Jan 16, 2025
@gystar gystar changed the title [Bug]: Failt to use deepseek-vl2 [Bug]: Fail to use deepseek-vl2 Jan 16, 2025
@DarkLight1337
Copy link
Member

You need to use the latest code (corresponding to latest docs), not latest release (corresponding to stable docs) of vLLM.. Check here for installation instructions.

@gystar
Copy link
Author

gystar commented Jan 16, 2025

@DarkLight1337 It woks, thanks a lot !
I have to point out that the version of transformers must be 4.45.2

@gystar
Copy link
Author

gystar commented Jan 16, 2025

@DarkLight1337
However, I encountered a "CUDA out of memory" error because the model is quite large and is mainly using the first GPU (there are two GPUs in total). Is it possible to configure a more balanced usage of both GPUs?

@DarkLight1337
Copy link
Member

You can use TP to distribute the model across your GPUs. We assume that your two GPUs are identical.

@gystar
Copy link
Author

gystar commented Jan 16, 2025

You can use TP to distribute the model across your GPUs. We assume that your two GPUs are identical.

@DarkLight1337 Thank you very much for your suggestion! But how can I use TP? Is there any reference?

@DarkLight1337
Copy link
Member

@gystar
Copy link
Author

gystar commented Jan 16, 2025

@gystar
Copy link
Author

gystar commented Jan 17, 2025

Dear @DarkLight1337 ,
My actual use case requires inputting the system prompt, user prompt, and images. I need to implement the following code:

def message_formator(prompt, texts_imgs):
    return [
        {
            "role": "system",
            "content": [
                {
                    'type': 'text',
                    'text': prompt
                }
            ],
        },
        {
            "role": "user",
            "content": [
                {
                    'type': 'image_url',
                    'image_url': {"url": f"data:image/png;base64,{encode_image(c)}"}
                } if isinstance(c, Image.Image) else
                {
                    'type': 'text',
                    'text': c
                }
                for c in texts_imgs
            ],
        },
    ]

def run_chat(image_urls: List[Image]):
    model_path="/home/xxx/model/deepseek-vl2"
    llm = LLM(model=model_path,
              max_model_len=4096,
              max_num_seqs=2,
              tensor_parallel_size=2,
              hf_overrides={"architectures": ["DeepseekVLV2ForCausalLM"]},
              limit_mm_per_prompt={"image": len(image_urls)})

    sampling_params = SamplingParams(temperature=0.0,
                                     max_tokens=128)
    outputs = llm.chat(
        message_formator("help write a code", ['hi']+[image_urls]+['hi']),
        sampling_params=sampling_params,
        chat_template=None,
    )

    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)

I referenced the example, but it throws an error:

NFO 01-17 09:20:03 chat_utils.py:330] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxxxx/xxxxx/refactored_version/deepseek.py", line 191, in <module>
[rank0]:     main()
[rank0]:   File "/home/xxxxx/xxxxx/refactored_version/deepseek.py", line 188, in main
[rank0]:     run_chat()
[rank0]:   File "/home/xxxxx/xxxxx/refactored_version/deepseek.py", line 161, in run_chat
[rank0]:     outputs = llm.chat(
[rank0]:   File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 698, in chat
[rank0]:     prompt_data = apply_hf_chat_template(
[rank0]:   File "/home/xxxxx/miniconda3/envs/xxxxx/lib/python3.10/site-packages/vllm/entrypoints/chat_utils.py", line 967, in apply_hf_chat_template
[rank0]:     raise ValueError(
[rank0]: ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
INFO 01-17 09:20:08 multiproc_worker_utils.py:126] Killing local vLLM worker processes
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x00005e5781605980)

How should I write a chat template?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 17, 2025

cc @Isotr0py are you aware of an existing chat template for this model? If not, maybe we should add one to the examples directory.

@Isotr0py
Copy link
Collaborator

This model uses its hf processor to format prompt from conversation instead of tokenizer. 😅

Let me add a template to the examples directory.

@Isotr0py
Copy link
Collaborator

@gystar You can use the chat template introduced in #12143, it works with online serving.

@shiva-vardhineedi
Copy link

@gystar can you let me know..how you solved this transformers error? what version of vLLM image did you exaclty use?

@gystar
Copy link
Author

gystar commented Jan 22, 2025

@gystar can you let me know..how you solved this transformers error? what version of vLLM image did you exaclty use?

4.45.2

@leoribeiro
Copy link

@gystar which version of the vllm are you using and which version of the transformers?
i'm trying to initiate a VLLM docker with deepseek-vl2 but still getting error.
I'm using:
image: docker.io/vllm/vllm-openai:v0.7.0
I tried to install the latest transformers but still getting error:

              pip install git+https://github.com/huggingface/transformers.git
              &&
              python3 -u -m vllm.entrypoints.openai.api_server 
              --host 0.0.0.0 
              --port 8000
              --model /checkpoints/local/deepseek-vl2
              --tensor-parallel-size 8 
              --load-format safetensors 
              --trust-remote-code 
              --max-model-len 4096 
              --gpu-memory-utilization 0.97 
              --enforce-eager

Error:

WARNING 01-27 11:53:10 registry.py:377] No model architectures are specified
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 899, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 863, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 133, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1047, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 972, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 343, in __init__
    self.multimodal_config = self._init_multimodal_config(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 402, in _init_multimodal_config
    if ModelRegistry.is_multimodal_model(architectures):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 432, in is_multimodal_model
    model_cls, _ = self.inspect_model_cls(architectures)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/registry.py", line 387, in inspect_model_cls
    for arch in architectures:

@DarkLight1337
Copy link
Member

Please see the note in the supported models page:

To use DeepSeek-VL2 series models, you have to pass --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' when running vLLM.

@kar9999
Copy link

kar9999 commented Feb 13, 2025

@gystar You can use the chat template introduced in #12143, it works with online serving.

how can I use the chat template introduced in [#12143]? version:

Image

Image
my chat template
Image

Image

@ozzmanmuhammad
Copy link

@kar9999 have you managed to find the solution for this? I'm also facing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants