Implement preemption via recomputation & Refactor scheduling logic #12

WoosukKwon · 2023-03-28T07:29:48Z

This PR implements a new preemption (eviction) mechanism "recomputation". In our benchmark results, recomputation is more efficient than swapping, because swapping incurs significant overheads due to numerous small data transfers between CPU and GPU. Thus, we use recomputation for our default preemption mechanism.

However, currently we do not support recomputation for sequence groups with multiple sequences. This is because when token blocks are shared, the recomputation logic becomes very complex and we do not have CUDA kernels to efficiently support it. We will use swapping for this case despite its overheads.

Besides, this PR also refactors the scheduling logic to be easier to understand.

zhuohan123

LGTM! Left some small comments.

zhuohan123 · 2023-03-30T07:43:23Z

cacheflow/master/scheduler.py

+        # sequences, we only support swapping.
+        # TODO(woosuk): Support recomputation for sequence groups with multiple
+        # sequences.
+


Should we add different preemption methods as options? For example, add a preempt_method function argument and can pick between swapping and recomputation.

I added PreemptionMode and allowed the caller of _preempt to specify the mode. If the mode is not specified, we use recomputation for single-output requests and swapping for multi-output requests.

zhuohan123 · 2023-03-30T07:46:08Z

cacheflow/master/policy.py

+class PolicyFactory:
+
+    _POLICY_REGISTRY = {
+        'fcfs': FCFS,


Will we add SSF in another PR?

Yes. In this PR, I tried to make minimal changes.

zhuohan123 · 2023-03-30T09:33:13Z

cacheflow/master/scheduler.py

-                    # No other sequence groups can be swapped out.
+                if self.running:
+                    # Preempt the lowest-priority sequence groups.
+                    victim_seq_group = self.running.pop(-1)
+                    self._preempt(victim_seq_group, blocks_to_swap_out)
+                    preempted.append(victim_seq_group)
+                else:
+                    # No other sequence groups can be preempted.
+                    # Preempt the current sequence group.
+                    self._preempt(seq_group, blocks_to_swap_out)
+                    preempted.append(seq_group)
                    break
            else:
+                # Append new slots to the sequence group.
                self._append(seq_group, blocks_to_copy)
-        self.running = self.running[:victim_idx + 1]
-
-        # 2. Swap in the swapped sequences if possible.
-        # NOTE: Here we implicitly assume FCFS scheduling.
-        # The swapped sequences are in LIFO order.
-        for i, seq_group in enumerate(reversed(self.swapped)):
-            if self.block_manager.can_swap_in(seq_group):
-                self._swap_in(seq_group, blocks_to_swap_in)
-                self._append(seq_group, blocks_to_copy)
-            else:
-                # OOM. Stop swapping.
-                self.swapped = self.swapped[:len(self.swapped) - i]
+                running.append(seq_group)
+        self.running = running
+
+        # Swap in the sequence groups in the SWAPPED state if possible.
+        self.swapped = self.policy.sort_by_priority(now, self.swapped)
+        while self.swapped:
+            seq_group = self.swapped[0]
+            # If the sequence group has been preempted in this step, stop.
+            if seq_group in preempted:
+                break
+            # If the sequence group cannot be swapped in, stop.
+            if not self.block_manager.can_swap_in(seq_group):
                break
-        else:
-            # All swapped sequences are swapped in.
-            self.swapped.clear()

-        # Ensure that swap-in and swap-out never happen at the same timestep.
-        if blocks_to_swap_in:
-            assert not blocks_to_swap_out
+            seq_group = self.swapped.pop(0)
+            self._swap_in(seq_group, blocks_to_swap_in)
+            self._append(seq_group, blocks_to_copy)
+            self.running.append(seq_group)

        num_batched_tokens = sum(
            seq_group.num_seqs(status=SequenceStatus.RUNNING)
            for seq_group in self.running
        )

-        # 3. Join new sequences if possible.
-        # NOTE: Here we implicitly assume FCFS scheduling.
-        # TODO(woosuk): Add a batching policy to control the batch size.
+        # Join waiting sequences if possible.
+        prompt_group_ids: List[int] = []
+        # NOTE(woosuk): The sequence groups in the SWAPPED state are strictly
+        # prioritized over the sequence groups in the WAITING state.
+        # This is because we want to bound the amount of CPU memory taken by
+        # the swapped sequence groups.
        if not self.swapped:
-            for i, seq_group in enumerate(self.pending):
+            self.waiting = self.policy.sort_by_priority(now, self.waiting)
+            while self.waiting:
+                seq_group = self.waiting[0]
+                # If the sequence group has been preempted in this step, stop.
+                if seq_group in preempted:
+                    break
+                # If the sequence group cannot be allocated, stop.
+                if not self.block_manager.can_allocate(seq_group):
+                    break
+
+                # If the number of batched tokens exceeds the limit, stop.
                num_prompt_tokens = seq_group.seqs[0].get_len()
-                if self.block_manager.can_allocate(seq_group):
-                    if (num_batched_tokens + num_prompt_tokens
-                        <= self.max_num_batched_tokens):
-                        self._allocate(seq_group)
-                        num_batched_tokens += num_prompt_tokens
-                        continue
-
-                self.pending = self.pending[i:]
-                break
-            else:
-                self.pending.clear()
+                if (num_batched_tokens + num_prompt_tokens
+                    > self.max_num_batched_tokens):
+                    break
+
+                seq_group = self.waiting.pop(0)
+                self._allocate(seq_group)
+                self.running.append(seq_group)
+                num_batched_tokens += num_prompt_tokens
+                prompt_group_ids.append(seq_group.group_id)


Maybe move this part to a new function dedicated to swapping and finding which sequences to run?

Good point. I moved the scheduling logic to a new function _schedule.

zhuohan123 · 2023-03-30T09:35:30Z

cacheflow/master/block_manager.py

@@ -76,7 +76,8 @@ def __init__(
        self.block_tables: Dict[int, BlockTable] = {}

    def can_allocate(self, seq_group: SequenceGroup) -> bool:
-        # NOTE: Here we assume that all sequences in the group have the same prompt.
+        # FIXME(woosuk): Here we assume that all sequences in the group share
+        # the same prompt. This may not be true for preempted sequences.


If I understand correctly, is this function only wrong when we use recomputation preemption for parallel decoding?

Yes, and for beam search as well.

masahi · 2023-12-06T01:57:33Z

Hi @WoosukKwon, if we had a kernel that can do one of the followings

Prefill with paged kv cache
A variant of the vllm single-token decode attention kernel that can process multiple decode tokens

I think we can solve the problem of preempt by recompute for multi-sequence requests. Do you agree with this?

We first run the normal prefill on the shared prompt tokens, followed by necessary copying of partially shared blocks.
Then, we can run one of two kernels above on intermediate decode tokens. That way, we can exploit shared KV cache entries for the prompt tokens while correctly restoring KV cache entries for decode tokens in each sequence.

…llm-project#12)

Transformers 4.39

* sharded prequantized checkpoints * update --------- Co-authored-by: Hao Zhang <[email protected]>

…ble_ROCm6.1 Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes

Add TP2 config for H100

Summary: Initial integration for the sparse-fused gemm. To achieve this, we need to ensure that we compress the weight matrix only once and never decompress it, as decompression is currently unsupported. Before this change, using `SparseParameter(SparseTensor)` meant that in `MergedColumnParallelLinear` and `QKVParallelLinear` every time a new shard was loaded by the `weight_loader` (e.g., the "q" portion of `QKVParallelLinear`), we would decompress the tensor in-order to use narrow to update the appropriate section of the weight tensor. With this change, `SparseParameter(SparseTensor)` is replaced with `LazyCompressedParameter`, which allows us to operate on `uncompressed_data` until we explicitly compress it. At that point, the `uncompressed_data` is compressed into `compressed_data` and freed. Currently, the detection of when to call compress is somewhat hacky. For `QKVParallelLinear`, we compress only after inserting "q", "k", and "v" shard ids, and for `MergedColumnParallelLinear`, we compress once we've inserted the same number of shards as outputs (determined by `len(output_sizes)`), which implicitly assumes one shard per output. Moving away from `SparseParameter(SparseTensor)` means that `SparseTensor` no longer handles dispatching to the custom ops; instead, this is handled by `SparseW16A16LinearMethod`. I believe this is a positive change overall. `SparseTensor` was an unnecessary extra layer of abstraction/indirection originally designed for the SLoRA work, not vLLM. This did result in the 2:4 sparse implementation breaking. However, it turns out it was already broken (i.e., it was decompressing and running dense within `SparseTensor`), so we "disable" it for now ("disable" meaning decompress and run dense instead). We should revisit all of this infrastructure post-MVP. --------- Co-authored-by: Andrew Feldman <[email protected]>

…x-633313fb5af9953589a88bc244a2a983 [Snyk] Security upgrade starlette from 0.38.6 to 0.40.0

…ctx-fix fixed phi3longrope rotary dim

…_version Update Habana UBI image to fix CVE, GRPC issue and WARMUP issue

WoosukKwon added 20 commits March 27, 2023 20:55

Add watermark to avoid thrashing

51a2557

Add Policy class

ab21ab8

Refactor Policy class

15e1e41

Use recomputation as preemption mechanism

0e2b47a

Bug fix in is_prompt

3c2da3e

Minor

bcf6a6f

Minor

4b56dc1

Add back swapping

9578fea

Apply watermark to can_swap_in

dd999f4

Refactor & Ensure priority swapped > waiting

86b7e9a

Revert server.py

f2f13f5

Bugfix

24968fc

Bugfix

9b159e4

Minor fix

2452758

Merge branch 'main' into recomp+sched

3b66f86

Fix merge errors

0a55d43

Add more comments

9edf814

Minor

8d81e01

Merge branch 'main' into recomp+sched

2333ef4

Add arrival time in fastapi frontend

ea8c27c

WoosukKwon requested a review from zhuohan123 March 30, 2023 00:04

zhuohan123 approved these changes Mar 30, 2023

View reviewed changes

WoosukKwon added 2 commits March 30, 2023 18:17

Merge branch 'main' into recomp+sched

b6745db

Move scheduling logic to _schedule & Add preemption mode

4c798c3

WoosukKwon merged commit 7a7929a into main Mar 30, 2023

WoosukKwon deleted the recomp+sched branch March 30, 2023 21:51

bigPYJ1151 pushed a commit to bigPYJ1151/vllm that referenced this pull request Sep 12, 2023

add activation cpu impl (vllm-project#12)

5a20041

xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 25, 2023

Fix cuda oom and padding (vllm-project#12)

2497a14

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

masahi mentioned this pull request Dec 6, 2023

[Roadmap] FlashInfer v0.1.0 release checklist flashinfer-ai/flashinfer#19

Closed

19 tasks

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Implement preemption via recomputation & Refactor scheduling logic (v…

13722b3

…llm-project#12)

slyalin pushed a commit to slyalin/vllm that referenced this pull request Apr 19, 2024

Merge pull request vllm-project#12 from slyalin/transformers_4_39

d848897

Transformers 4.39

sfc-gh-hazhang added a commit to sfc-gh-hazhang/vllm that referenced this pull request May 7, 2024

sharded prequantized checkpoints (vllm-project#12)

332890b

* sharded prequantized checkpoints * update --------- Co-authored-by: Hao Zhang <[email protected]>

fxmarty pushed a commit to fxmarty/vllm-public that referenced this pull request May 31, 2024

Merge pull request vllm-project#12 from ROCm/public_IntegrationNoTuna…

61197f3

…ble_ROCm6.1 Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

ykim362 pushed a commit to ykim362/vllm that referenced this pull request Jun 17, 2024

Merge pull request vllm-project#12 from wenxcs/wenxh/fp8-on-h100-tp2-pr

9f42e46

Add TP2 config for H100

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

Xaenalt added a commit to Xaenalt/vllm that referenced this pull request Dec 9, 2024

Merge pull request vllm-project#12 from red-hat-data-services/snyk-fi…

a211f6c

…x-633313fb5af9953589a88bc244a2a983 [Snyk] Security upgrade starlette from 0.38.6 to 0.40.0

garg-amit added a commit to garg-amit/vllm that referenced this pull request Dec 11, 2024

Merge pull request vllm-project#12 from microsoft/gargamit/phio-long-…

0d43a81

…ctx-fix fixed phi3longrope rotary dim

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Jan 15, 2025

Merge pull request vllm-project#12 from vaibhavjainwiz/update_adaptor…

ba88ff9

…_version Update Habana UBI image to fix CVE, GRPC issue and WARMUP issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement preemption via recomputation & Refactor scheduling logic #12

Implement preemption via recomputation & Refactor scheduling logic #12

WoosukKwon commented Mar 28, 2023 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Mar 30, 2023

WoosukKwon Mar 30, 2023

zhuohan123 Mar 30, 2023

WoosukKwon Mar 30, 2023

zhuohan123 Mar 30, 2023

WoosukKwon Mar 30, 2023

zhuohan123 Mar 30, 2023

WoosukKwon Mar 30, 2023

masahi commented Dec 6, 2023

Implement preemption via recomputation & Refactor scheduling logic #12

Implement preemption via recomputation & Refactor scheduling logic #12

Conversation

WoosukKwon commented Mar 28, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Dec 6, 2023

WoosukKwon commented Mar 28, 2023 •

edited

Loading