How to set ParallelConfig and SchedulerConfig? #653

wjy3326 · 2023-07-04T11:41:37Z

wjy3326
Jul 4, 2023

Is ParallelConfig.pipeline_parallel_size used on multiple gpu cards? Can it be set to the number of GPU cards? Does it relate to processing multiple prompts and generating multiple results in parallel? For example, if there are 2 gpu cards and 7 requests, will it distribute the 7 requests simultaneously to the 2 gpu cards? How is the allocation done? Also, what do the parameters "max_num_batched_tokens" and "max_num_seqs" represent in SchedulerConfig? How can I set it to preserve longer context?

WoosukKwon · 2023-07-04T17:57:09Z

WoosukKwon
Jul 4, 2023
Maintainer

vLLM currently does not support pipeline parallelism. The ParallelConfig.pipeline_parallel_size attribute is for future use. When multiple GPUs are used, vLLM leverages tensor parallelism to shard the model and inputs evenly to all GPU workers. Therefore, in your example, 7 requests are all sharded to the 2 GPUs, and the GPUs communicate the intermediate tensors using NCCL all reduce.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to set ParallelConfig and SchedulerConfig? #653

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to set ParallelConfig and SchedulerConfig? #653

wjy3326 Jul 4, 2023

Replies: 1 comment

WoosukKwon Jul 4, 2023 Maintainer

wjy3326
Jul 4, 2023

WoosukKwon
Jul 4, 2023
Maintainer