Skip to content

Commit

Permalink
llama : fix defrag logic (ggerganov#11707)
Browse files Browse the repository at this point in the history
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
  • Loading branch information
ggerganov authored Feb 7, 2025
1 parent 2d219b3 commit ed926d8
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions src/llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8801,12 +8801,14 @@ static int llama_decode_impl(
//llama_synchronize(&lctx);

// decide if we need to defrag the kv cache
if (cparams.causal_attn && cparams.defrag_thold >= 0.0f) {
const float fragmentation = kv_self.n >= 128 ? 1.0f - float(kv_self.used)/float(kv_self.n) : 0.0f;
if (cparams.causal_attn && cparams.defrag_thold > 0.0f) {
// - do not defrag small contexts (i.e. < 2048 tokens)
// - count the padding towards the number of used tokens
const float fragmentation = kv_self.n >= 2048 ? std::max(0.0f, 1.0f - float(kv_self.used + llama_kv_cache_get_padding(cparams))/float(kv_self.n)) : 0.0f;

// queue defragmentation for next llama_kv_cache_update
if (fragmentation > cparams.defrag_thold) {
//LLAMA_LOG_INFO("fragmentation: %.2f\n", fragmentation);
LLAMA_LOG_DEBUG("%s: fragmentation: %.2f - requesting defrag\n", __func__, fragmentation);

llama_kv_cache_defrag(kv_self);
}
Expand Down

0 comments on commit ed926d8

Please sign in to comment.