[ci][distributed] add distributed test gptq_marlin with tp = 2 #6010

llmpros · 2024-07-01T03:53:02Z

follow-up pr of #6007

youkaichao · 2024-07-01T04:00:43Z

Thanks for the PR! You need to move the test from models to distributed:

https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml

In addition, because of some limitations, you might only test the tp=2 case. It is not safe to test two vLLM instances together.

DarkLight1337 · 2024-07-01T06:03:44Z

Imo we should keep the original tp=1 test and add a new file in distributed tests for the tp=2 case.

llmpros · 2024-07-01T06:11:54Z

Imo we should keep the original tp=1 test and add a new file in distributed tests for the tp=2 case.

make sense - so is it better to abstract the common following test codes into a new code block (e.g. test_gptq_marlin_common) to avoid duplicate and let tp=1 (under tests/models) and tp=2 (under tests/distributed) to call test_gptq_marlin_common respectively, or just copy the original tests/models/test_gptq_marlin.py to tests/distributed/ but only change tp=2 (with small amount of duplicate)?

    # test_gptq_marlin_common()
    # Run marlin.
    with vllm_runner(model_name=model_name,
                     revision=revision,
                     dtype=dtype,
                     quantization="marlin",
                     max_model_len=MAX_MODEL_LEN,
                     tensor_parallel_size=2,
                     distributed_executor_backend=distributed_executor_backend
                     ) as gptq_marlin_model:

        gptq_marlin_outputs = gptq_marlin_model.generate_greedy_logprobs(
            example_prompts[:-1], max_tokens, num_logprobs)
    _ROPE_DICT.clear()  # clear rope cache to avoid rope dtype error

    # Run gptq.
    # The naive gptq kernel doesn't support bf16 yet.
    # Here we always compare fp16/bf16 gpt marlin kernel
    # to fp16 gptq kernel.
    with vllm_runner(model_name=model_name,
                     revision=revision,
                     dtype="half",
                     quantization="gptq",
                     max_model_len=MAX_MODEL_LEN,
                     tensor_parallel_size=2,
                     distributed_executor_backend=distributed_executor_backend
                     ) as gptq_model:
        gptq_outputs = gptq_model.generate_greedy_logprobs(
            example_prompts[:-1], max_tokens, num_logprobs)
   return [gptq_marlin_outputs, gptq_outputs]

DarkLight1337 · 2024-07-01T06:31:53Z

Let's abstract out the code (similar to what I did for the multimodal distributed tests)

DarkLight1337 · 2024-07-02T01:32:46Z

tests/models/test_gptq_marlin.py

@@ -17,8 +18,6 @@

 from .utils import check_logprobs_close

-os.environ["TOKENIZERS_PARALLELISM"] = "true"


Please keep this line as it avoids unnecessary warnings from HuggingFace

llmpros · 2024-07-02T05:26:54Z

@DarkLight1337 it looks like the new unit test (test_distributed_gptq_marlin with tp=2) failed with following info. I may grab a box with 2 GPU and install the current main to test in real env.


[2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301) Process VllmWorkerProcess:
--
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301) Traceback (most recent call last):
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     self.run()
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     self._target(*self._args, **self._kwargs)
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     worker = worker_factory()
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 67, in _create_worker
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 311, in init_worker
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     self.worker = worker_class(*args, **kwargs)
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 87, in __init__
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 196, in __init__
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     self.attn_backend = get_attn_backend(
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 45, in get_attn_backend
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 151, in which_attn_to_use
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     if torch.cuda.get_device_capability()[0] < 8:
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     prop = get_device_properties(device)
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     _lazy_init()  # will define _get_device_properties
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 279, in _lazy_init
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301)     raise RuntimeError(
  | [2024-07-02T04:45:17Z] (VllmWorkerProcess pid=22301) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

DarkLight1337 · 2024-07-02T05:32:24Z

This happens because you initialized CUDA too early (probably indirectly via imports). Try to avoid importing torch-related stuff in the top level code of your test.

DarkLight1337 · 2024-07-02T06:20:40Z

If the issue persists, #6056 should help you.

DarkLight1337 · 2024-07-05T05:23:00Z

Please merge the latest main into your branch as it fixes some issues with distributed tests.

DarkLight1337 · 2024-07-05T05:23:21Z

Also it's really difficult to keep track of your changes if you keep force-pushing.

llmpros · 2024-07-05T21:28:41Z

Also it's really difficult to keep track of your changes if you keep force-pushing.

thanks for your suggestion 👍 I will pay attentions in future. BTW - I have merged the latest main into this PR and we will see the result

llmpros · 2024-07-06T04:04:58Z

with latest main, distributed tests were still failed because of

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

DarkLight1337 · 2024-07-06T04:07:45Z

I think it's because you initialized CUDA via is_quant_method_supported. Try running it inside the test function instead of calling it at the top level.

… top level

tests/models/test_gptq_marlin.py

llmpros · 2024-07-06T07:01:35Z

looks like the distributed tests still failed

DarkLight1337 · 2024-07-06T07:03:54Z

Hmm I guess we can't even call is_quant_method_supported before initializing vLLM runner. How should we solve this @youkaichao ? Maybe we have to fallback to using fork method for this model specifically...

youkaichao · 2024-07-06T07:10:05Z

if you merged the latest main, is_quant_method_supported should not initialize cuda.

the problem might be you are testing with multiple models, and pytest is using a single process to test multiple models. after you test one model, the process has cuda initialized.

llmpros · 2024-07-06T07:21:26Z

if you merged the latest main, is_quant_method_supported should not initialize cuda.

the problem might be you are testing with multiple models, and pytest is using a single process to test multiple models. after you test one model, the process has cuda initialized.

Thanks so much for the details. If we have to pick up one model for distributed test, which following is better to use?
https://github.com/vllm-project/vllm/blob/main/tests/models/test_gptq_marlin.py#L24

MODELS = [
    # act_order==False, group_size=channelwise
    ("robertgshaw2/zephyr-7b-beta-channelwise-gptq", "main"),
    # act_order==False, group_size=128
    ("TheBloke/Llama-2-7B-GPTQ", "main"),

    # act_order==True, group_size=128
    ("TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ", "main"),
    # act_order==True, group_size=64
    ("TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ", "gptq-4bit-64g-actorder_True"),
    # act_order==True, group_size=32
    ("TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ", "gptq-4bit-32g-actorder_True"),

    # 8-bit, act_order==True, group_size=channelwise
    ("TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ", "gptq-8bit--1g-actorder_True"),
    # 8-bit, act_order==True, group_size=128
    ("TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ", "gptq-8bit-128g-actorder_True"),
    # 8-bit, act_order==True, group_size=32
    ("TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ", "gptq-8bit-32g-actorder_True"),

    # 4-bit, act_order==True, group_size=128
    ("TechxGenus/gemma-1.1-2b-it-GPTQ", "main")
]

youkaichao · 2024-07-06T07:28:22Z

if you can locally reproduce the problem, it would be easier to debug.

here is a simple script you can put in the test code, to find out which function initializes cuda:

import torch

import sys
found = False
import traceback

def _trace_calls(frame, event, arg=None):
    if event in ['call', 'return']:
        # for every function call or return
        try:
            global found
            # Temporarily disable the trace function
            sys.settrace(None)
            # check condition here
            if not found and torch.cuda.is_initialized():
                found = True
                traceback.print_stack()
            # Re-enable the trace function
            sys.settrace(_trace_calls)
        except NameError:
            # modules are deleted during shutdown
            pass
    return _trace_calls
sys.settrace(_trace_calls)

github-actions · 2024-10-25T02:03:22Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2024-11-24T02:09:00Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

llmpros force-pushed the add_test branch 3 times, most recently from 6a2004c to 8e128d2 Compare July 1, 2024 03:56

llmpros force-pushed the add_test branch 3 times, most recently from 63b9545 to 49141fb Compare July 1, 2024 05:30

llmpros changed the title ~~add tp>1 test coverage for gptq_marlin~~ [ci][distributed] move test gptq_marlin to distributed with tp = 2 Jul 1, 2024

llmpros force-pushed the add_test branch 2 times, most recently from 64c0686 to f12288d Compare July 1, 2024 18:29

llmpros changed the title ~~[ci][distributed] move test gptq_marlin to distributed with tp = 2~~ [ci][distributed] add distributed test gptq_marlin with tp = 2 Jul 1, 2024

llmpros force-pushed the add_test branch from f12288d to f395b0f Compare July 2, 2024 01:28

DarkLight1337 reviewed Jul 2, 2024

View reviewed changes

llmpros force-pushed the add_test branch from f395b0f to d6677eb Compare July 2, 2024 01:41

llmpros force-pushed the add_test branch from d6677eb to 329d884 Compare July 2, 2024 05:38

llmpros force-pushed the add_test branch 3 times, most recently from d1d19d7 to 74d32a3 Compare July 5, 2024 03:38

[ci][distributed] add distributed test gptq_marlin with tp = 2

e011cde

llmpros force-pushed the add_test branch from 74d32a3 to e011cde Compare July 5, 2024 21:27

running cuda related func inside test function instead calling at the…

d9de3ea

… top level

llmpros force-pushed the add_test branch from b5bec2f to d9de3ea Compare July 6, 2024 04:27

DarkLight1337 reviewed Jul 6, 2024

View reviewed changes

tests/models/test_gptq_marlin.py Outdated Show resolved Hide resolved

llmpros added 3 commits July 5, 2024 21:30

fix the import location of cuda_device_count_stateless

ab7fa8c

all grtq tests may call CUDA related funcs inside

e79ffa8

formatting

e978360

github-actions bot added the stale label Oct 25, 2024

github-actions bot closed this Nov 24, 2024

mergify bot added the ci/build label Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci][distributed] add distributed test gptq_marlin with tp = 2 #6010

[ci][distributed] add distributed test gptq_marlin with tp = 2 #6010

llmpros commented Jul 1, 2024

youkaichao commented Jul 1, 2024

DarkLight1337 commented Jul 1, 2024 •

edited

Loading

llmpros commented Jul 1, 2024 •

edited

Loading

DarkLight1337 commented Jul 1, 2024

DarkLight1337 Jul 2, 2024

llmpros commented Jul 2, 2024

DarkLight1337 commented Jul 2, 2024 •

edited

Loading

DarkLight1337 commented Jul 2, 2024

DarkLight1337 commented Jul 5, 2024

DarkLight1337 commented Jul 5, 2024

llmpros commented Jul 5, 2024

llmpros commented Jul 6, 2024

DarkLight1337 commented Jul 6, 2024

llmpros commented Jul 6, 2024

DarkLight1337 commented Jul 6, 2024 •

edited

Loading

youkaichao commented Jul 6, 2024

llmpros commented Jul 6, 2024 •

edited

Loading

youkaichao commented Jul 6, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

		@@ -17,8 +18,6 @@

		from .utils import check_logprobs_close

		os.environ["TOKENIZERS_PARALLELISM"] = "true"

[ci][distributed] add distributed test gptq_marlin with tp = 2 #6010

[ci][distributed] add distributed test gptq_marlin with tp = 2 #6010

Conversation

llmpros commented Jul 1, 2024

youkaichao commented Jul 1, 2024

DarkLight1337 commented Jul 1, 2024 • edited Loading

llmpros commented Jul 1, 2024 • edited Loading

DarkLight1337 commented Jul 1, 2024

DarkLight1337 Jul 2, 2024

Choose a reason for hiding this comment

llmpros commented Jul 2, 2024

DarkLight1337 commented Jul 2, 2024 • edited Loading

DarkLight1337 commented Jul 2, 2024

DarkLight1337 commented Jul 5, 2024

DarkLight1337 commented Jul 5, 2024

llmpros commented Jul 5, 2024

llmpros commented Jul 6, 2024

DarkLight1337 commented Jul 6, 2024

llmpros commented Jul 6, 2024

DarkLight1337 commented Jul 6, 2024 • edited Loading

youkaichao commented Jul 6, 2024

llmpros commented Jul 6, 2024 • edited Loading

youkaichao commented Jul 6, 2024

github-actions bot commented Oct 25, 2024

github-actions bot commented Nov 24, 2024

DarkLight1337 commented Jul 1, 2024 •

edited

Loading

llmpros commented Jul 1, 2024 •

edited

Loading

DarkLight1337 commented Jul 2, 2024 •

edited

Loading

DarkLight1337 commented Jul 6, 2024 •

edited

Loading

llmpros commented Jul 6, 2024 •

edited

Loading