-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][WIP] 2nd try of Hybrid allocator for full attention & sliding window attention interleaved models #13296
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…k_table Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…k_table Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…k_table Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…_allocator Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…vllm-project#12922) Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Is this PR also built on top of #12086? |
Hi @heheda12345, thanks for the great work and sorry again for the delays in my review. This work is truly amazing! Now I think I (almost) fully understand the idea. Most changes proposed in this PR look very reasonable and to be the "right" thing to do. What I'm unsure about are:
Let me write down my understanding here: To my understanding, this PR starts from the observation that if a certain pattern of layers is repeating, we can use the symmetry to simplify the memory view. For instance, if a single type of attention repeats for the entire layers, we can view them as if the model only has a single layer. For another example, for models with N:1 sliding & global attention layers, we can consider as if the model only has N+1 layers. This is essentially the concept of "group" in this PR. [Confusion 1] However, the N+1 layers are NOT "group". IIRC, there's no name for this set of layers. I feel like these N+1 layers should be defined as a "group"... Then, the N+1 layers need to dynamically share a fixed amount of memory space. A good news is that, for sliding & global attention mix, both types of attention require the same size of memory (e.g., Let's say L0 uses global attention while L1, ..., LN use sliding window attention. Then, in this PR, we create a cache manager for each layer; N+1 managers in total. These managers are controlled by [Proposal 1] While this may be contradictory with my previous suggestion, now I feel that it's nicer to name this [Question 1] I understand that we need N+1 block tables. However, do we really need N+1 memory managers? I feel like we can have 2 managers, one for global attention and another for SWA, and have the SWA memory manager manage the blocks for N layers. This could potentially reduce the overheads. For allocation, free, and prefix caching, [Confusion 2] I don't fully understand how we manage the memory for sliding window attention (apologies if I missed something). Especially, I find it difficult to understand how the prefix caching works when it's mixed with global attention. I think we should be clearer about this. [Question 2] Specifically, we need to provide clear answers to the following questions:
The changes in model executor seem relatively easy to understand. We only need to do some extra work because we now have N+1 block tables instead of one. [Proposal 2] Do we really need |
The goal of this PR is:
High Level Idea of Hybrid Allocator
When the model becomes hybrid, we want each layer have their own
block_table
, so that we can allocate different number of blocks to different layers. For example, allocate blocks to all tokens in the full attention layers, and only allocate blocks to tokens in the sliding window for sliding window layers.However, allocating blocks for each layer is inefficient as we need to perform the allocation
num_layer
times for everyblock_size
tokens. So we introduce KV Cache Group to group layers that can share the sameblock_table
. We need to make sure the following properties:For example,
mistralai/Ministral-8B-Instruct-2410
has 9 full attention layers and 27 sliding window layers. We can group the layers as:Each group has 9 layers (property 2), and each group either contains full attention layers or sliding window layers (property 1). We can allocate blocks for each group, and share the block_table among layers in the same group.
To achieve this, we need to add the
group
dimension to the block_table, changing it fromList[int]
toList[List[int]]
, and support the block allocation & model execution with group dimension. We also need to make abstraction to different type of layers, so that we can manage them in the same way.For layer abstraction, we introduce
SpecializedManager
with the following interfaces:get_possible_cached_prefix
to detect the hitted prefix based on the hit rule of the layer;get_num_new_blocks
to get the number of new blocks needed to be allocated for the layer;remove_useless_blocks
to free the blocks that are not used anymore;This PR implements this interface for full attention layers and sliding window layers. The abstraction also helps the current
KVCacheManager
to support sliding window models.For block allocation, we introduce
HybridKVCacheManager
with the same interface asKVCacheManager
, but support multiple groups:get_computed_blocks
finds the longest prefix that is a cache hit prefix of all groups;allocate_slots
sums up the number of new blocks needed for each group to check whether we can allocate blocks for the given request, and then allocate blocks for each group;free
frees the blocks for each group, sorting them by the eviction order and put them back to thefree_block_queue
;get_num_common_prefix_blocks
is computed for each group.Note that the hybrid allocator still do allocation / free / prefix cache at block level, so we can reuse most operation on
_block_pool
,_free_block_queue
, and_cached_block_hash_to_block
. We move the operations on these objects fromKVCacheManager
to a newBlockPool
class inv1/core/block_pool.py
to avoid code duplication of the two managers.For model execution, we
GroupedBlockTable
to record the block_table in the device side. It contains multipleBlockTable
instance, one for each group. TheGroupedBlockTable
has the same interface withBlockTable
, and will broadcast the operations to allBlockTable
instances.num_layer
seperate tensors to saving the kv cache at runtime, we usenum_layer/num_group
tensors now, and each tensor is shared by one layer in each group. As different groups owns different block_id, the memory allocated for each group is not overlapped. For example, in the above Ministral model, the memory becomes:To avoid introducing overhead to uniform models, we still keep the original KVCacheManager & BlockTable. When the model has only one group, we use KVCacheManager & BlockTable, passing block_table with type
List[int]
between them inSchedulerOutput
, which is the same as before. When the model has multiple groups, we use HybridKVCacheManager & GroupedBlockTable, passing block_table with typeList[List[int]]
between them inSchedulerOutput
. Therefore, the type of block_table inNewRequestData
andCachedRequestData
becomesMayGroupedBlockIDs
:Detailed Modifications
v1/core/specialized_manager.py
makes some abstraction for the different logic of full attention layer and sliding window layer. In addition to support hybrid allocator, we also use these abstraction to support pure sliding window model in KVCacheManager (v1/core/kv_cache_manager.py
).v1/core/kv_cache_manager.py]
,v1/core/hybrid_kv_cache_manager.py
,v1/core/block_pool.py
two kv cache managers with the same interface. The common functions are inBlockPool
class inblock_pool.py
.[
v1/core/hybrid_kv_cache_manager.py
] The HybridKVCacheManager, see its implementation in the above section.[
attention/layer.py
&v1/worker/gpu_model_runner.py
] Change the attention metadata to a Dict[layer_name, AttentionMetadata].[
v1/core/kv_cache_interface.py
,v1/core/kv_cache_utils.py
,v1/worker/gpu_model_runner.py
]is_kv_cache_page_size_uniform
and_get_kv_cache_config_uniform_page_size
, and tell the worker the kv cache memory layout withKVCacheConfig.tensors
. In the ministral example, it will be:[
config.py
] Small interface change of KVCacheSpec inside KVCacheConfig: move from dict[layer_name, KVCacheSpec] to an attribute in each group (KVCacheGroup.kv_cache_spec
).[
v1/core/kv_cache_utils.py
]:kv_cache_group_id
to block hash, to know which group the block belongs to.block_id=-1
[
v1/core/scheduler.py
]List[int]
toMaybeGroupedBlockID
.if num_new_tokens == 0:
and hybrid manager, ignore all computed tokens as a temporary solution. (It would be easier to handle this corner case in HybridKVCacheManager instead of the scheduler.)[
v1/worker/block_table.py
,v1/worker/gpu_model_runner
] add GroupedBlockTable. Small interface changes of BlockTable class to make the broadcast easier, e.g., removestart_index
argumenet inappend_row
.GPUModelRunner.__init__
instead ofGPUModelRunner._initialize_kv_caches
.[
forward_context.py
] When AttentionMetadata becomes a dict for each layer, it will be difficult to get the global information that is held by all layers, e.g.,num_input_tokens
. Therefore, addForwardMetadata
to save the global information.[
v1/request.py
] cherry-pick [V1] Move KV block hashes from Request to KVCacheManager #12922[
v1/worker/gpu_model_runner.py
] remove model-related args likenum_attn_layers
,num_query_heads
,num_kv_heads
. They are available in kv-cache-spec.RFC #11382