-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Support KV Cache Compaction #10646
Comments
@heheda12345 @ywang96 This seems to be quite related to your new memory allocator. |
On the research side, @lynnliu030 is experimenting with token drop's impact on memory allocation. I think we can revisit this around EOY and discuss the exact API change. Can you also list some example KV compaction method you had in mind? I guess you currently have attention sink and H2O, but any other type you expect to support and how would that affect the design? |
The key challenge for supporting more methods is the memory management side. For both attention sink and H2O, they maintain the same number of tokens across different heads and layers. If we want to support more advanced methods, we need a more flexible memory layout. |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Motivation.
KV cache compaction (i.e., token dropping) can significantly reduce memory footprint in llm serving (especially for long generation and large batch size workloads). The plan is to support the latest KV compaction methods, such as FastGen and DuoAttetnion, and also support a flexible interface for developers to add their own compaction methods.
Proposed Change.
To support KV Cache compaction, we need:
free_and_reallocate
functionality to reduce memory fragmentation after memory compaction. A workaround is to useblock_size=1
.A prototype is available at https://github.com/LMCache/LMCache/blob/compaction/examples/compactor/README.md .
Feel free to share any thoughts and comments!!
Feedback Period.
Several weeks.
CC List.
@simon-mo @KuntaiDu @comaniac @youkaichao
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: