[core][distributed] support layer size undividable by pp size in pipeline parallel inference #6115

youkaichao · 2024-07-03T19:13:42Z

andoorve

There's a subtle point here:

vllm/vllm/executor/distributed_gpu_executor.py

Lines 38 to 46 in 966fe72

    
           num_blocks = self._run_workers("determine_num_available_blocks", ) 
        
           # Since we use a shared centralized controller, we take the minimum 
        
           # number of blocks across all workers to make sure all the memory 
        
           # operators can be applied to all workers. 
        
           num_gpu_blocks = min(b[0] for b in num_blocks) 
        
           num_cpu_blocks = min(b[1] for b in num_blocks) 
        
           return num_gpu_blocks, num_cpu_blocks

We take min of blocks across all workers - as a result of this the GPU mem utilization of 0...n-2th will be slightly lower than expected.

I don't think it's a big deal, just something to be aware of.

vllm/config.py

vllm/worker/worker.py

youkaichao · 2024-07-03T23:38:09Z

We take min of blocks across all workers - as a result of this the GPU mem utilization of 0...n-2th will be slightly lower than expected.

Good point. I also noticed this. Some GPU memory waste, but it is better than not being supported.

andoorve

LGTM

Signed-off-by: Alvant <[email protected]>

youkaichao added 6 commits July 3, 2024 12:07

move init_device code

58e31f6

support

e551f0a

add tests

95efe72

store rank in parallel config

6adc2be

populate rank info

6f47947

further relax for spec decode

16e69c3

andoorve reviewed Jul 3, 2024

View reviewed changes

vllm/config.py Show resolved Hide resolved

andoorve reviewed Jul 3, 2024

View reviewed changes

vllm/worker/worker.py Show resolved Hide resolved

andoorve approved these changes Jul 3, 2024

View reviewed changes

youkaichao merged commit 3de6e6a into vllm-project:main Jul 3, 2024
58 of 65 checks passed

youkaichao deleted the pp_odd_size branch July 3, 2024 23:41

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 7, 2024

[core][distributed] support n layers % pp size != 0 (vllm-project#6115)

5b09f66

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[core][distributed] support n layers % pp size != 0 (vllm-project#6115)

05f8b95

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[core][distributed] support n layers % pp size != 0 (vllm-project#6115)

6584397

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[core][distributed] support n layers % pp size != 0 (vllm-project#6115)

0128ff0

Signed-off-by: Alvant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][distributed] support layer size undividable by pp size in pipeline parallel inference #6115

[core][distributed] support layer size undividable by pp size in pipeline parallel inference #6115

youkaichao commented Jul 3, 2024

andoorve left a comment

youkaichao commented Jul 3, 2024

andoorve left a comment

	num_blocks = self._run_workers("determine_num_available_blocks", )

	# Since we use a shared centralized controller, we take the minimum
	# number of blocks across all workers to make sure all the memory
	# operators can be applied to all workers.
	num_gpu_blocks = min(b[0] for b in num_blocks)
	num_cpu_blocks = min(b[1] for b in num_blocks)

	return num_gpu_blocks, num_cpu_blocks

[core][distributed] support layer size undividable by pp size in pipeline parallel inference #6115

[core][distributed] support layer size undividable by pp size in pipeline parallel inference #6115

Conversation

youkaichao commented Jul 3, 2024

andoorve left a comment

Choose a reason for hiding this comment

youkaichao commented Jul 3, 2024

andoorve left a comment

Choose a reason for hiding this comment