[Model] Add support for 360zhinao #4078

garycaokai · 2024-04-15T03:25:51Z

Add support for 360zhinao model

We released the 360Zhinao model series:

360Zhinao-7B-Base
360Zhinao-7B-Chat-4K
360Zhinao-7B-Chat-32K
360Zhinao-7B-Chat-360K

Notable features of our 360Zhinao models are:

Base Model: Leveraging a high-quality corpus of 3.4 trillion tokens consisting of mainly Chinese, English and code, we achieved competitive performance on relevant benchmarks against other 7B models.
Chat Models: Powerful chat capabilities and three context lengths of 4K, 32K and 360K. 360K (around 500k Chinese characters) is the longest context length among Chinese open-sourced models upon release (Apr. 11, 2024).

garycaokai · 2024-04-15T14:01:49Z

@simon-mo can you help us review the code?

simon-mo · 2024-04-16T21:17:34Z

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

garycaokai · 2024-04-17T03:41:50Z

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

This branch works for the vllm 0.4.0 version. I will merge these 2 new refactor：

[Core] Refactor model loading code (https://github.com/vllm-project/vllm/pull/4097[)](https://github.com/vllm-project/vllm/commit/69e1d2fb6922b2d388bae41286d8867976cbd6c6)
Yard1
Yard1
committed
8 hours ago

[Core][Refactor] move parallel_utils into vllm/distributed (https://github.com/vllm-project/vllm/pull/3950[)](https://github.com/vllm-project/vllm/commit/63e7176f265be43dcc425f5ab4ab45c90234f5c3)
youkaichao
youkaichao
committed
last week

garycaokai · 2024-04-17T10:37:08Z

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

finished merge 4097,3950

simon-mo · 2024-04-18T08:56:42Z

I'm running into the following issues:

Completion not working
Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qihoo360/360Zhinao-7B-Chat-4K",
        "prompt": "Who are you?"
    }'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}

garycaokai · 2024-04-18T12:20:12Z

I'm running into the following issues:

Completion not working
Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qihoo360/360Zhinao-7B-Chat-4K",
        "prompt": "Who are you?"
    }'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}

it is a chat model, we use the chat api.

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "messages": [
        {
            "role": "user",
            "content": "who are you"
        }
    ],
    "stream": false,
    "messages": [
        {
            "role": "user",
            "content": "who are you"
        }
    ],
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]
}'

the result is :

{
    "id": "cmpl-afab46b914ac40c192cde2c1d4870b92",
    "object": "chat.completion",
    "created": 12789567,
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I am an AI trained to assist with a wide range of tasks and questions. I can help with information on a variety of topics, such as answering questions, setting reminders, and providing news updates."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 21,
        "total_tokens": 62,
        "completion_tokens": 41
    }
}

we will add this config to tokenizer_config.json later:
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

simon-mo · 2024-04-18T18:25:35Z

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.

    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

garycaokai · 2024-04-19T04:25:53Z

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

thanks, we will fix it

garycaokai · 2024-04-28T03:29:30Z

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

@simon-mo
Noticed vllm 0.4.1 have generation_config.get("eos_token_id") now , we use generation_config eos_token_id as the default stop_token_ids. And we add the default template. it works now

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "messages": [
        {
            "role": "user",
            "content": "Who are you?"
        }
    ]
}'
{
    "id": "cmpl-5be15427a5ad4562b5a1aa792fe12c7e",
    "object": "chat.completion",
    "created": 1714274507,
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I am an AI, a computer program designed to assist users with various tasks."
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 158333
        }
    ],
    "usage": {
        "prompt_tokens": 22,
        "total_tokens": 39,
        "completion_tokens": 17
    }
}

simon-mo · 2024-05-02T18:32:01Z

Looks good, please fix lint by running ./format.sh

garycaokai · 2024-05-03T02:50:53Z

./format.sh

OK

$ ./format.sh 
vLLM yapf: Done
vLLM mypy:
Success: no issues found in 3 source files
Success: no issues found in 7 source files
Success: no issues found in 4 source files
Success: no issues found in 3 source files
Success: no issues found in 6 source files
Success: no issues found in 2 source files
Success: no issues found in 10 source files
Success: no issues found in 4 source files
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done

simon-mo · 2024-05-03T02:59:16Z

https://github.com/vllm-project/vllm/actions/runs/8720404543/job/23921845033?pr=4078#step:5:1

Run yapf --diff --recursive .
--- ./vllm/model_executor/models/zhinao.py	(original)
+++ ./vllm/model_executor/models/zhinao.py	(reformatted)
@@ -327,7 +327,9 @@
         super().__init__()
         self.config = config
         self.linear_method = linear_method
-        self.model = ZhinaoModel(config, linear_method, lora_config=lora_config)
+        self.model = ZhinaoModel(config,
+                                 linear_method,
+                                 lora_config=lora_config)
         self.unpadded_vocab_size = config.vocab_size
         if lora_config:
             self.unpadded_vocab_size += lora_config.lora_extra_vocab_size

simon-mo · 2024-05-03T02:59:31Z

Did you push the changes?

simon-mo · 2024-05-03T03:00:22Z

Also feel free to add to https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst and README https://github.com/vllm-project/vllm?tab=readme-ov-file#about

garycaokai · 2024-05-06T02:21:58Z

Also feel free to add to https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst and README https://github.com/vllm-project/vllm?tab=readme-ov-file#about

updated. thanks

garycaokai · 2024-05-06T08:12:37Z

merged v0.4.2 [Misc][Refactor] Generalize linear_method to be quant_method (https://github.com/vllm-project/vllm/pull/4373)

garycaokai · 2024-05-07T08:33:48Z

@simon-mo is it ready to merge?

garycaokai closed this Jun 12, 2024

garycaokai force-pushed the main branch from 369a9ff to 8f89d72 Compare June 12, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add support for 360zhinao #4078

[Model] Add support for 360zhinao #4078

garycaokai commented Apr 15, 2024

garycaokai commented Apr 15, 2024

simon-mo commented Apr 16, 2024

garycaokai commented Apr 17, 2024

garycaokai commented Apr 17, 2024

simon-mo commented Apr 18, 2024

garycaokai commented Apr 18, 2024

simon-mo commented Apr 18, 2024

garycaokai commented Apr 19, 2024

garycaokai commented Apr 28, 2024

simon-mo commented May 2, 2024 •

edited

Loading

garycaokai commented May 3, 2024

simon-mo commented May 3, 2024

simon-mo commented May 3, 2024

simon-mo commented May 3, 2024

garycaokai commented May 6, 2024

garycaokai commented May 6, 2024

garycaokai commented May 7, 2024

[Model] Add support for 360zhinao #4078

[Model] Add support for 360zhinao #4078

Conversation

garycaokai commented Apr 15, 2024

garycaokai commented Apr 15, 2024

simon-mo commented Apr 16, 2024

garycaokai commented Apr 17, 2024

garycaokai commented Apr 17, 2024

simon-mo commented Apr 18, 2024

garycaokai commented Apr 18, 2024

simon-mo commented Apr 18, 2024

garycaokai commented Apr 19, 2024

garycaokai commented Apr 28, 2024

simon-mo commented May 2, 2024 • edited Loading

garycaokai commented May 3, 2024

simon-mo commented May 3, 2024

simon-mo commented May 3, 2024

simon-mo commented May 3, 2024

garycaokai commented May 6, 2024

garycaokai commented May 6, 2024

garycaokai commented May 7, 2024

simon-mo commented May 2, 2024 •

edited

Loading