You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I build the newest master branch with #2279 commit.
And I run the following command python -m vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
I meet the error:
INFO 01-29 09:41:47 api_server.py:209] args: Namespace(host='0.0.0.0', port=8081, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='./Mistral-7B-Instruct-v0.2-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 01-29 09:41:47 config.py:177] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-29 09:41:49,090 INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-29 09:41:50 llm_engine.py:72] Initializing an LLM engine with config: model='./Mistral-7B-Instruct-v0.2-AWQ', tokenizer='./Mistral-7B-Instruct-v0.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/my/vllm/vllm/entrypoints/openai/api_server.py", line 217, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/async_llm_engine.py", line 615, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/async_llm_engine.py", line 364, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers_ray(placement_group)
File "/home/my/vllm/vllm/engine/llm_engine.py", line 260, in _init_workers_ray
self.driver_worker = Worker(
^^^^^^^
TypeError: Worker.__init__() got an unexpected keyword argument 'cache_config'
2024-01-29 09:41:54,676 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerVllm.init_worker() (pid=3160378, ip=10.20.4.57, actor_id=ca7bf2aa56e3f1a0c1a7678201000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f043133e7d0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/ray_utils.py", line 23, in init_worker
self.worker = worker_init_fn()
^^^^^^^^^^^^^^^^
File "/home/my/vllm/vllm/engine/llm_engine.py", line 247, in <lambda>
lambda rank=rank, local_rank=local_rank: Worker(
^^^^^^^
TypeError: Worker.__init__() got an unexpected keyword argument 'cache_config'
I am running with python=3.11, CUDA 12.1, driver 530 with 2x RTX 3090 NVLink.
I notice there is a discussion (#2279 (comment)) about cache_config, I am not sure whether it is related
The text was updated successfully, but these errors were encountered:
I tried to rollback to previous commit, this error disapper but I meet another error Failed: Cuda error /home/ysq/vllm/csrc/custom_all_reduce.cuh:417 'resource already mapped',
Since this unexpected keyword argument 'cache_config' is caused by #2279
I build the newest master branch with #2279 commit.
And I run the following command
python -m vllm.entrypoints.openai.api_server --model ./Mistral-7B-Instruct-v0.2-AWQ --quantization awq --dtype auto --host 0.0.0.0 --port 8081 --tensor-parallel-size 2
I meet the error:
I am running with
python=3.11
,CUDA 12.1
,driver 530
with 2x RTX 3090 NVLink.I notice there is a discussion (#2279 (comment)) about
cache_config
, I am not sure whether it is relatedThe text was updated successfully, but these errors were encountered: