Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Sliding window for block manager v2 #4545
[Core] Sliding window for block manager v2 #4545
Changes from 44 commits
6b1d3b2
d55db5e
be53001
ac070a1
b37f028
368c3ee
c4e533d
7bc88f2
2825cd7
0ade169
f703af2
9661776
a0459b4
6326fbc
785aa19
165b7a8
e26631e
57678f4
22e9bb8
be984d1
34b0fef
b0261fd
0a3d3b6
c947103
a28116f
2a5436d
cc1467f
1edc9be
85d8fd9
29fd0d5
09c192a
54a5e93
609a9ce
5079d9a
83f82d9
0bb1f67
7e7bd02
728b722
96b71a0
0ac37c0
e086854
bdad3bb
d9521ba
24844dd
f4ede62
63d9e50
7f00204
fa1ea2f
97aa968
ce04cde
0a5a73d
60f0d25
2855de6
14f33b1
d58aefd
507d2dc
cc6cab4
dffca79
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm but this compares block manager v1 vs v2? Doesn't this mean everything else is equivalent (i.e., kernels and things like that).
cc @cadedaniel do you think this is expected outputs are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it's not the same kernel arguments. With v2 manager we're passing up 15 kv entries more to the paged attention kernel. With v1, the kv entries "wrap around" in the blocks, but they do not with v2 (needed for prefix caching etc.), see description in #4768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the reason why this is so hard to test is because the semantics of blocks are changed in V2. E.g. we now have clear distinction between mutable and immutable blocks. so the kernels that previously would overwrite blocks (causing U.B. with copy-on-write in V1) now don't, but the downside is they capture additional context since we don't yet have masking.
The tests in this PR are not really good enough to catch correctness issues with the sliding window block mapping. The error tolerance in this test is very high and the unit test
test_sliding_window
only checks num consumed is correct.That said, this is an improvement over the previous sliding window tests and I think we can merge and follow up later..
FWIW the way I'd test this is one/both of the following:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid these tests won't catch issues with the block mapping. E.g. I expect error to accumulate over many tokens before we see a significant divergence in attention scores. 10/4096 tokens is not very much, same for 128/4096 although it's better.
WDYT? Is my intuition right? Should we test with larger generation size? Another option is to patch
sliding_window
to be smaller (e.g. two blocks) so the impact of any error is larger. If we go with patchingsliding_window
we could even use one of the 68m models for a faster test.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an issue with the block tables passed to single token decode, which was causing different output with 1024 tokens. I have now fixed that and bumped test size to 1024.
However, it's still slightly incorrect because the decode kernel does not support sliding window natively - the way it works now it just takes all the blocks passed in (up to seq_len). With v1 manager, the sliding window uses blocks in a "ring buffer" fashion, so this is not a problem. With the new block manager we need potentially to start attention computation in the middle of a block, otherwise we pay attention to a few tokens too many. It doesn't seem to affect this test though.
I have started fixing the decode kernel, but I think that should be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the changes but want to say that yes, this problem is known and we should fix it eventually (awesome if you want to do it). let's get this PR in with good tests for where we're at and future PR can fix the decode kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any good way to assert the return prompt token len > 4k?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the higest we can get? Or more of arbitrarty value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's arbitrary; I've been mostly getting 4/5 right, sometimes 5/5