Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Sliding window for block manager v2 #4545
[Core] Sliding window for block manager v2 #4545
Changes from 19 commits
6b1d3b2
d55db5e
be53001
ac070a1
b37f028
368c3ee
c4e533d
7bc88f2
2825cd7
0ade169
f703af2
9661776
a0459b4
6326fbc
785aa19
165b7a8
e26631e
57678f4
22e9bb8
be984d1
34b0fef
b0261fd
0a3d3b6
c947103
a28116f
2a5436d
cc1467f
1edc9be
85d8fd9
29fd0d5
09c192a
54a5e93
609a9ce
5079d9a
83f82d9
0bb1f67
7e7bd02
728b722
96b71a0
0ac37c0
e086854
bdad3bb
d9521ba
24844dd
f4ede62
63d9e50
7f00204
fa1ea2f
97aa968
ce04cde
0a5a73d
60f0d25
2855de6
14f33b1
d58aefd
507d2dc
cc6cab4
dffca79
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid these tests won't catch issues with the block mapping. E.g. I expect error to accumulate over many tokens before we see a significant divergence in attention scores. 10/4096 tokens is not very much, same for 128/4096 although it's better.
WDYT? Is my intuition right? Should we test with larger generation size? Another option is to patch
sliding_window
to be smaller (e.g. two blocks) so the impact of any error is larger. If we go with patchingsliding_window
we could even use one of the 68m models for a faster test.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an issue with the block tables passed to single token decode, which was causing different output with 1024 tokens. I have now fixed that and bumped test size to 1024.
However, it's still slightly incorrect because the decode kernel does not support sliding window natively - the way it works now it just takes all the blocks passed in (up to seq_len). With v1 manager, the sliding window uses blocks in a "ring buffer" fashion, so this is not a problem. With the new block manager we need potentially to start attention computation in the middle of a block, otherwise we pay attention to a few tokens too many. It doesn't seem to affect this test though.
I have started fixing the decode kernel, but I think that should be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the changes but want to say that yes, this problem is known and we should fix it eventually (awesome if you want to do it). let's get this PR in with good tests for where we're at and future PR can fix the decode kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have some comments on this code? (what is this branch for?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a unit test for this? see others here https://github.com/vllm-project/vllm/blob/main/tests/core/block/test_block_table.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with interface, but if we use allocate_immutable, isn't this supposed to guarantee the hack below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to prevent modifications, at least for the
NaiveBlocks
- we now use aNullBlock
wrapper anyways.allocate_immutable()
makes the block participate in hash tables of valid prefixes. I'm not sure we want the null block there (and even then, it's probably not very useful), so I keptallocate_mutable()