-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : refactor k-shift implementation + initial defragmentation #5691
Conversation
It's ok to reuse the same scheduler, being able to use it with different graphs with the same instance was one of the goals. The reallocations are not really a performance problem because it is a very fast operation, previously essentially it was done on every eval. The exception is if it causes a buffer reallocation, but that shouldn't happen if it was properly initialized with a worst-case graph. Currently, reallocations happens all the time precisely because the K-shift is part of the worst-case graph used to initialize the scheduler, and it shouldn't be, because it has a different graph topology than normal evaluations. Reallocations however are a problem for the implementation of pipeline parallelism, because they can cause the addresses of some tensors to change, which in turn requires a full synchronization to avoid overwriting some tensors, which breaks the parallelism. So this is something that I am working on addressing. The obvious solution is just calling |
I see, I hadn't noticed the reserves occur on Yeah, it's a bit cumbersome to have to keep track of the graph topology manually. I'm thinking something about passing an optional callback that constructs a worst-case graph together with the graph we want to evaluate and the backend automatically decides whether to materialize the worst-case graph and reserve based on a hash of the topology (nodes + types). But it still sounds complicated, so probably not worth it |
c5ae094
to
89b2a43
Compare
21f60d1
to
42ddf48
Compare
will add in a separate PR ggml-ci
I just realized that there is a substantial optimization possible in the defragmentation strategy. Will try to implement this, so converted back to draft |
8d979cf
to
4eaaace
Compare
The initial defragmentation strategy was very naive - start from the beginning of the cache, when we find an empty cell we move the next non-empty cell in it. This caused moving the entire cache when the "hole" is in the start (typical example during context shift, where we remove the oldest cells). The implementation also prevented multi-threading and With the new strategy, we find the "holes" starting from the beginning of the cache and fill those holes with data from the end of the cache. This now avoids overlapping There is still some chance for a bug, so will leave wider application for next PRs @slaren Let me know if you have concerns about the new K-shift and defrag graphs that are introduced here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, these graphs always do the operations inplace, and they don't allocate any tensors in the compute buffer. So I don't think it will cause any issues with the scheduler, or buffer reallocations.
I think this commit introduces a bug causing bad model responses, at least as per tested below. commit bf08e00 (this commit / bugged)
commit f762501 (previous commit / fine)
EDIT: This seems to affect Gemma only, I tested with Mistral and Phi and they both produce the right output. Also tested on Linux, same outcome. |
@dranger003 Thanks for spotting this - should be fixed now (269de86) |
Thanks, it's working now. |
…g#5691) * llama : refactor k-shift implementation ggml-ci * llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add * llama : cont k-shift refactoring + normalize type names ggml-ci * minor : fix MPI builds * llama : reuse n_rot from the build context ggml-ci * llama : revert enum name changes from this PR ggml-ci * llama : update llama_rope_type * llama : add comment about rope values * llama : fix build * passkey : apply kv cache updates explicitly ggml-ci * llama : change name to llama_kv_cache_update() * llama : add llama_kv_cache_seq_pos_max() * passkey : fix llama_kv_cache_seq_pos_max() usage * llama : some llama_kv_cell simplifications * llama : add llama_kv_cache_compress (EXPERIMENTAL) * llama : add alternative KV cache merging (EXPERIMENTAL) * llama : add llama_kv_cache_defrag * llama : comments * llama : remove llama_kv_cache_compress will add in a separate PR ggml-ci * llama : defragment via non-overlapping moves * llama : ggml_graph based defrag implementation ggml-ci * llama : switch the loop order in build_defrag * llama : add comments
…g#5691) * llama : refactor k-shift implementation ggml-ci * llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add * llama : cont k-shift refactoring + normalize type names ggml-ci * minor : fix MPI builds * llama : reuse n_rot from the build context ggml-ci * llama : revert enum name changes from this PR ggml-ci * llama : update llama_rope_type * llama : add comment about rope values * llama : fix build * passkey : apply kv cache updates explicitly ggml-ci * llama : change name to llama_kv_cache_update() * llama : add llama_kv_cache_seq_pos_max() * passkey : fix llama_kv_cache_seq_pos_max() usage * llama : some llama_kv_cell simplifications * llama : add llama_kv_cache_compress (EXPERIMENTAL) * llama : add alternative KV cache merging (EXPERIMENTAL) * llama : add llama_kv_cache_defrag * llama : comments * llama : remove llama_kv_cache_compress will add in a separate PR ggml-ci * llama : defragment via non-overlapping moves * llama : ggml_graph based defrag implementation ggml-ci * llama : switch the loop order in build_defrag * llama : add comments
ref #3380
The goal is to separate the K-shift graph from the main LLM compute graph in order to be able to apply the shift on demand, instead of lazily through
llama_decode()
. This is a first step towards implementing more KV cache operations, such as defragmentation and compressionAPI changes:
llama_kv_cache_seq_shift()
tollama_kv_cache_seq_add()
llama_kv_cache_seq_pos_max()
llama_kv_cache_defrag()
llama_kv_cache_update()
The defragmentation should work, though it is probably not as performant as it could be since we are copying the KV cache data to host memory. However, since we expect to do this operation rarely, it could still be useful as it is. For now, usingllama_kv_cache_defrag()
is completely optional and is only demonstrated with thepasskey
example. Later I think we can calculate how much cells will be recovered from the defragmentation and start applying it automatically only in cases where a certain fragmentation threshold is exceededThe next steps for improving the KV cache management will be to simplify typical usages (e.g. context shift, self-extend) with a simpler API.
make -j && ./passkey ./models/llama-7b-v2/ggml-model-f16.gguf 250 2 90
make -j && ./passkey ./models/llama-7b-v2/ggml-model-f16.gguf 250 1 130
TODO:
[ ] avoid node reallocationsllama_kv_cache_seq_shift()
tollama_kv_cache_seq_add()
llama_kv_cache_update()
works when called directly