Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Add support for 360zhinao #4078

Closed
wants to merge 0 commits into from

Conversation

garycaokai
Copy link

Add support for 360zhinao model

We released the 360Zhinao model series:

  • 360Zhinao-7B-Base
  • 360Zhinao-7B-Chat-4K
  • 360Zhinao-7B-Chat-32K
  • 360Zhinao-7B-Chat-360K

Notable features of our 360Zhinao models are:

  • Base Model: Leveraging a high-quality corpus of 3.4 trillion tokens consisting of mainly Chinese, English and code, we achieved competitive performance on relevant benchmarks against other 7B models.
  • Chat Models: Powerful chat capabilities and three context lengths of 4K, 32K and 360K. 360K (around 500k Chinese characters) is the longest context length among Chinese open-sourced models upon release (Apr. 11, 2024).

@garycaokai
Copy link
Author

@simon-mo can you help us review the code?

@simon-mo
Copy link
Collaborator

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

@garycaokai
Copy link
Author

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

This branch works for the vllm 0.4.0 version. I will merge these 2 new refactor:

[Core] Refactor model loading code (https://github.com/vllm-project/vllm/pull/4097[)](https://github.com/vllm-project/vllm/commit/69e1d2fb6922b2d388bae41286d8867976cbd6c6)
Yard1
Yard1
committed
8 hours ago

[Core][Refactor] move parallel_utils into vllm/distributed (https://github.com/vllm-project/vllm/pull/3950[)](https://github.com/vllm-project/vllm/commit/63e7176f265be43dcc425f5ab4ab45c90234f5c3)
youkaichao
youkaichao
committed
last week

@garycaokai
Copy link
Author

Getting

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/xmo/vllm/vllm/entrypoints/openai/api_server.py", line 157, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 347, in from_engine_args
    engine = cls(
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/async_llm_engine.py", line 421, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/xmo/vllm/vllm/engine/llm_engine.py", line 121, in __init__
    self.model_executor = executor_class(
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 39, in __init__
    self._init_worker()
  File "/home/xmo/vllm/vllm/executor/gpu_executor.py", line 66, in _init_worker
    self.driver_worker.load_model()
  File "/home/xmo/vllm/vllm/worker/worker.py", line 113, in load_model
    self.model_runner.load_model()
  File "/home/xmo/vllm/vllm/worker/model_runner.py", line 158, in load_model
    self.model = get_model(
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 58, in get_model
    model_class = _get_model_architecture(model_config)[0]
  File "/home/xmo/vllm/vllm/model_executor/model_loader.py", line 41, in _get_model_architecture
    model_cls = ModelRegistry.load_model_cls(arch)
  File "/home/xmo/vllm/vllm/model_executor/models/__init__.py", line 99, in load_model_cls
    module = importlib.import_module(
  File "/opt/conda/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/xmo/vllm/vllm/model_executor/models/zhinao.py", line 43, in <module>
    from vllm.model_executor.parallel_utils.parallel_state import (
ModuleNotFoundError: No module named 'vllm.model_executor.parallel_utils.parallel_state'

On

python -m vllm.entrypoints.openai.api_server --model qihoo360/360Zhinao-7B-Chat-4K --trust-remote-code

finished merge 4097,3950

@simon-mo
Copy link
Collaborator

I'm running into the following issues:

  • Completion not working
  • Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.
$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qihoo360/360Zhinao-7B-Chat-4K",
        "prompt": "Who are you?"
    }'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}

@garycaokai
Copy link
Author

I'm running into the following issues:

  • Completion not working
  • Chat template is missing in tokenizer config, the default one will just keep the generation going forever without EOS.
$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qihoo360/360Zhinao-7B-Chat-4K",
        "prompt": "Who are you?"
    }'
{"id":"cmpl-480c0d4beba84d43a9474e4b83615800","object":"text_completion","created":1713430511,"model":"qihoo360/360Zhinao-7B-Chat-4K","choices":[{"index":0,"text":"<|im_end|>\n<|im_start|><|im_start|><|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n<|im_start|>\n","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":4,"total_tokens":20,"completion_tokens":16}}

it is a chat model, we use the chat api.

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "messages": [
        {
            "role": "user",
            "content": "who are you"
        }
    ],
    "stream": false,
    "messages": [
        {
            "role": "user",
            "content": "who are you"
        }
    ],
    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]
}'

the result is :

{
    "id": "cmpl-afab46b914ac40c192cde2c1d4870b92",
    "object": "chat.completion",
    "created": 12789567,
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I am an AI trained to assist with a wide range of tasks and questions. I can help with information on a variety of topics, such as answering questions, setting reminders, and providing news updates."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 21,
        "total_tokens": 62,
        "completion_tokens": 41
    }
}

we will add this config to tokenizer_config.json later:
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

@simon-mo
Copy link
Collaborator

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.

    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

@garycaokai
Copy link
Author

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.

    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

thanks, we will fix it

@garycaokai
Copy link
Author

The tokenizer_config.json should also include the following so it doesn't need to be specified by the client each time. Please let me know once the hf or modelscope version is updated.

    "stop_token_ids": [
        158326,
        158333,
        158332
    ],
    "stop": [
        "<eod>",
        "<|im_end|>",
        "<|im_start|>"
    ]

@simon-mo
Noticed vllm 0.4.1 have generation_config.get("eos_token_id") now , we use generation_config eos_token_id as the default stop_token_ids. And we add the default template. it works now

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "messages": [
        {
            "role": "user",
            "content": "Who are you?"
        }
    ]
}'
{
    "id": "cmpl-5be15427a5ad4562b5a1aa792fe12c7e",
    "object": "chat.completion",
    "created": 1714274507,
    "model": "qihoo360/360Zhinao-7B-Chat-4K",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "I am an AI, a computer program designed to assist users with various tasks."
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 158333
        }
    ],
    "usage": {
        "prompt_tokens": 22,
        "total_tokens": 39,
        "completion_tokens": 17
    }
}

@simon-mo
Copy link
Collaborator

simon-mo commented May 2, 2024

Looks good, please fix lint by running ./format.sh

@garycaokai
Copy link
Author

./format.sh

OK

$ ./format.sh 
vLLM yapf: Done
vLLM mypy:
Success: no issues found in 3 source files
Success: no issues found in 7 source files
Success: no issues found in 4 source files
Success: no issues found in 3 source files
Success: no issues found in 6 source files
Success: no issues found in 2 source files
Success: no issues found in 10 source files
Success: no issues found in 4 source files
vLLM codespell: Done
vLLM ruff:
vLLM isort: Done

@simon-mo
Copy link
Collaborator

simon-mo commented May 3, 2024

https://github.com/vllm-project/vllm/actions/runs/8720404543/job/23921845033?pr=4078#step:5:1

Run yapf --diff --recursive .
--- ./vllm/model_executor/models/zhinao.py	(original)
+++ ./vllm/model_executor/models/zhinao.py	(reformatted)
@@ -327,7 +327,9 @@
         super().__init__()
         self.config = config
         self.linear_method = linear_method
-        self.model = ZhinaoModel(config, linear_method, lora_config=lora_config)
+        self.model = ZhinaoModel(config,
+                                 linear_method,
+                                 lora_config=lora_config)
         self.unpadded_vocab_size = config.vocab_size
         if lora_config:
             self.unpadded_vocab_size += lora_config.lora_extra_vocab_size

@simon-mo
Copy link
Collaborator

simon-mo commented May 3, 2024

Did you push the changes?

@garycaokai
Copy link
Author

@garycaokai
Copy link
Author

@simon-mo is it ready to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants