Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement preemption via recomputation & Refactor scheduling logic #12

Merged
merged 22 commits into from
Mar 30, 2023

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Mar 28, 2023

This PR implements a new preemption (eviction) mechanism "recomputation". In our benchmark results, recomputation is more efficient than swapping, because swapping incurs significant overheads due to numerous small data transfers between CPU and GPU. Thus, we use recomputation for our default preemption mechanism.

However, currently we do not support recomputation for sequence groups with multiple sequences. This is because when token blocks are shared, the recomputation logic becomes very complex and we do not have CUDA kernels to efficiently support it. We will use swapping for this case despite its overheads.

Besides, this PR also refactors the scheduling logic to be easier to understand.

@WoosukKwon WoosukKwon requested a review from zhuohan123 March 30, 2023 00:04
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some small comments.

# sequences, we only support swapping.
# TODO(woosuk): Support recomputation for sequence groups with multiple
# sequences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add different preemption methods as options? For example, add a preempt_method function argument and can pick between swapping and recomputation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added PreemptionMode and allowed the caller of _preempt to specify the mode. If the mode is not specified, we use recomputation for single-output requests and swapping for multi-output requests.

class PolicyFactory:

_POLICY_REGISTRY = {
'fcfs': FCFS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we add SSF in another PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In this PR, I tried to make minimal changes.

Comment on lines 60 to +147
# No other sequence groups can be swapped out.
if self.running:
# Preempt the lowest-priority sequence groups.
victim_seq_group = self.running.pop(-1)
self._preempt(victim_seq_group, blocks_to_swap_out)
preempted.append(victim_seq_group)
else:
# No other sequence groups can be preempted.
# Preempt the current sequence group.
self._preempt(seq_group, blocks_to_swap_out)
preempted.append(seq_group)
break
else:
# Append new slots to the sequence group.
self._append(seq_group, blocks_to_copy)
self.running = self.running[:victim_idx + 1]

# 2. Swap in the swapped sequences if possible.
# NOTE: Here we implicitly assume FCFS scheduling.
# The swapped sequences are in LIFO order.
for i, seq_group in enumerate(reversed(self.swapped)):
if self.block_manager.can_swap_in(seq_group):
self._swap_in(seq_group, blocks_to_swap_in)
self._append(seq_group, blocks_to_copy)
else:
# OOM. Stop swapping.
self.swapped = self.swapped[:len(self.swapped) - i]
running.append(seq_group)
self.running = running

# Swap in the sequence groups in the SWAPPED state if possible.
self.swapped = self.policy.sort_by_priority(now, self.swapped)
while self.swapped:
seq_group = self.swapped[0]
# If the sequence group has been preempted in this step, stop.
if seq_group in preempted:
break
# If the sequence group cannot be swapped in, stop.
if not self.block_manager.can_swap_in(seq_group):
break
else:
# All swapped sequences are swapped in.
self.swapped.clear()

# Ensure that swap-in and swap-out never happen at the same timestep.
if blocks_to_swap_in:
assert not blocks_to_swap_out
seq_group = self.swapped.pop(0)
self._swap_in(seq_group, blocks_to_swap_in)
self._append(seq_group, blocks_to_copy)
self.running.append(seq_group)

num_batched_tokens = sum(
seq_group.num_seqs(status=SequenceStatus.RUNNING)
for seq_group in self.running
)

# 3. Join new sequences if possible.
# NOTE: Here we implicitly assume FCFS scheduling.
# TODO(woosuk): Add a batching policy to control the batch size.
# Join waiting sequences if possible.
prompt_group_ids: List[int] = []
# NOTE(woosuk): The sequence groups in the SWAPPED state are strictly
# prioritized over the sequence groups in the WAITING state.
# This is because we want to bound the amount of CPU memory taken by
# the swapped sequence groups.
if not self.swapped:
for i, seq_group in enumerate(self.pending):
self.waiting = self.policy.sort_by_priority(now, self.waiting)
while self.waiting:
seq_group = self.waiting[0]
# If the sequence group has been preempted in this step, stop.
if seq_group in preempted:
break
# If the sequence group cannot be allocated, stop.
if not self.block_manager.can_allocate(seq_group):
break

# If the number of batched tokens exceeds the limit, stop.
num_prompt_tokens = seq_group.seqs[0].get_len()
if self.block_manager.can_allocate(seq_group):
if (num_batched_tokens + num_prompt_tokens
<= self.max_num_batched_tokens):
self._allocate(seq_group)
num_batched_tokens += num_prompt_tokens
continue

self.pending = self.pending[i:]
break
else:
self.pending.clear()
if (num_batched_tokens + num_prompt_tokens
> self.max_num_batched_tokens):
break

seq_group = self.waiting.pop(0)
self._allocate(seq_group)
self.running.append(seq_group)
num_batched_tokens += num_prompt_tokens
prompt_group_ids.append(seq_group.group_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this part to a new function dedicated to swapping and finding which sequences to run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I moved the scheduling logic to a new function _schedule.

@@ -76,7 +76,8 @@ def __init__(
self.block_tables: Dict[int, BlockTable] = {}

def can_allocate(self, seq_group: SequenceGroup) -> bool:
# NOTE: Here we assume that all sequences in the group have the same prompt.
# FIXME(woosuk): Here we assume that all sequences in the group share
# the same prompt. This may not be true for preempted sequences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, is this function only wrong when we use recomputation preemption for parallel decoding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and for beam search as well.

@WoosukKwon WoosukKwon merged commit 7a7929a into main Mar 30, 2023
@WoosukKwon WoosukKwon deleted the recomp+sched branch March 30, 2023 21:51
bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023
@masahi
Copy link

masahi commented Dec 6, 2023

Hi @WoosukKwon, if we had a kernel that can do one of the followings

  • Prefill with paged kv cache
  • A variant of the vllm single-token decode attention kernel that can process multiple decode tokens

I think we can solve the problem of preempt by recompute for multi-sequence requests. Do you agree with this?

We first run the normal prefill on the shared prompt tokens, followed by necessary copying of partially shared blocks.
Then, we can run one of two kernels above on intermediate decode tokens. That way, we can exploit shared KV cache entries for the prompt tokens while correctly restoring KV cache entries for decode tokens in each sequence.

slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 19, 2024
sfc-gh-hazhang added a commit to sfc-gh-hazhang/vllm that referenced this pull request May 7, 2024
* sharded prequantized checkpoints

* update

---------

Co-authored-by: Hao Zhang <[email protected]>
fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024
…ble_ROCm6.1

Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes
ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <[email protected]>
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
Summary:

Initial integration for the sparse-fused gemm. To achieve this, we need
to ensure that we compress the weight matrix only once and never
decompress it, as decompression is currently unsupported.

Before this change, using `SparseParameter(SparseTensor)` meant that in
`MergedColumnParallelLinear` and `QKVParallelLinear` every time a new
shard was loaded by the `weight_loader` (e.g., the "q" portion of
`QKVParallelLinear`), we would decompress the tensor in-order to use
narrow to update the appropriate section of the weight tensor. With this
change, `SparseParameter(SparseTensor)` is replaced with
`LazyCompressedParameter`, which allows us to operate on
`uncompressed_data` until we explicitly compress it. At that point, the
`uncompressed_data` is compressed into `compressed_data` and freed.
Currently, the detection of when to call compress is somewhat hacky. For
`QKVParallelLinear`, we compress only after inserting "q", "k", and "v"
shard ids, and for `MergedColumnParallelLinear`, we compress once we've
inserted the same number of shards as outputs (determined by
`len(output_sizes)`), which implicitly assumes one shard per output.

Moving away from `SparseParameter(SparseTensor)` means that
`SparseTensor` no longer handles dispatching to the custom ops; instead,
this is handled by `SparseW16A16LinearMethod`. I believe this is a
positive change overall. `SparseTensor` was an unnecessary extra layer
of abstraction/indirection originally designed for the SLoRA work, not
vLLM.

This did result in the 2:4 sparse implementation breaking. However, it
turns out it was already broken (i.e., it was decompressing and running
dense within `SparseTensor`), so we "disable" it for now ("disable"
meaning decompress and run dense instead).

We should revisit all of this infrastructure post-MVP.

---------

Co-authored-by: Andrew Feldman <[email protected]>
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
Xaenalt added a commit to Xaenalt/vllm that referenced this pull request Dec 9, 2024
…x-633313fb5af9953589a88bc244a2a983

[Snyk] Security upgrade starlette from 0.38.6 to 0.40.0
garg-amit added a commit to garg-amit/vllm that referenced this pull request Dec 11, 2024
Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Jan 15, 2025
…_version

Update Habana UBI image to fix CVE, GRPC issue and WARMUP issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants