bug: Context Length slider's value is not considered when performing inference #1569

leitdeux · 2024-01-14T08:09:53Z

Describe the bug
I have a manually installed model (Mistral Instruct 7B, v0.2) and given it a model.json file with a max context length field of 32,768. In Jan's Engine Parameters "Context Length" section I can observe that the slider's max value is 32,768, as expected. However, when submitting a message, it appears as though the max context length is always used regardless of the slider's actual value. On macOS, I can observe very high RAM usage in Activity Monitor and in Jan, even when I lower the context length via the slider to a much lower value such as 4,096, which normally incurs much lower memory usage. Note: this issue wasn't present until the last 2 updates or so.

Steps to reproduce
Steps to reproduce the behavior:

Open Mistral 7b's model.json and set its ctx_len setting to 32768
Launch Jan and create a thread and set its Context Length to 4096
Submit a message, and observe memory usage indicative of a context length of 32,768

Expected behavior
The value of Context Length slider should be used instead of always the max possible context length value defined in a model's model.json.

Screenshots
If applicable, add screenshots to help explain your issue.

Environment details

Operating System: macOS Sonoma 14.2.1
Jan Version: 0.4.3-143
Processor: Apple M2 Pro
RAM: 16GB

Logs

�[0m20240114 08:03:17.803817 UTC 160592 INFO  Nitro version:  - main.cc:44
20240114 08:03:17.804000 UTC 160592 INFO  Server started, listening at: 127.0.0.1:3928 - main.cc:48
20240114 08:03:17.804000 UTC 160592 INFO  Please load your model - main.cc:49
20240114 08:03:17.804001 UTC 160592 INFO  Number of thread is:10 - main.cc:52
20240114 08:03:17.805000 UTC 160592 INFO  Not found models folder, start server as usual - llamaCPP.h:2510
{"timestamp":1705219398,"level":"INFO","function":"loadModelImpl","line":478,"message":"system info","n_threads":10,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}

2024-01-14T08:03:18.116Z [NITRO]::Error: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8

2024-01-14T08:03:18.116Z [NITRO]::Error: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama

2024-01-14T08:03:18.123Z [NITRO]::Error: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...

2024-01-14T08:03:18.134Z [NITRO]::Error: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...

2024-01-14T08:03:18.135Z [NITRO]::Error: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors

2024-01-14T08:03:18.151Z [NITRO]::Error: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear

2024-01-14T08:03:18.151Z [NITRO]::Error: llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB

2024-01-14T08:03:18.207Z [NITRO]::Error: ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4166.08 MiB, ( 4166.14 / 10922.67)
llm_load_tensors: system memory used  = 4165.48 MiB

2024-01-14T08:03:18.208Z [NITRO]::Error: ...................
2024-01-14T08:03:18.208Z [NITRO]::Error: ..............................
2024-01-14T08:03:18.208Z [NITRO]::Error: ...................
2024-01-14T08:03:18.208Z [NITRO]::Error: ...........................

2024-01-14T08:03:18.209Z [NITRO]::Error: llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating

2024-01-14T08:03:18.209Z [NITRO]::Error: ggml_metal_init: found device: Apple M2 Pro

2024-01-14T08:03:18.209Z [NITRO]::Error: ggml_metal_init: picking default device: Apple M2 Pro

2024-01-14T08:03:18.210Z [NITRO]::Error: ggml_metal_init: default.metallib not found, loading from source

2024-01-14T08:03:18.210Z [NITRO]::Error: ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/ef/jan/extensions/@janhq/inference-nitro-extension/dist/bin/mac-arm64/ggml-metal.metal'

2024-01-14T08:03:18.214Z [NITRO]::Error: ggml_metal_init: GPU name:   Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU

2024-01-14T08:03:18.222Z [NITRO]::Error: ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  4096.00 MiB, ( 8263.70 / 10922.67)

2024-01-14T08:03:18.551Z [NITRO]::Error: llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB

2024-01-14T08:03:18.551Z [NITRO]::Error: ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 8263.72 / 10922.67)

2024-01-14T08:03:18.551Z [NITRO]::Error: llama_build_graph: non-view tensors processed: 676/676

2024-01-14T08:03:18.552Z [NITRO]::Error: llama_new_context_with_model: compute buffer total size = 2139.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2136.02 MiB, (10399.72 / 10922.67)

2024-01-14T08:03:19.271Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  590][              initialize] Available slots:
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  598][              initialize]  -> Slot 0 - max context: 32768

2024-01-14T08:03:19.274Z [NITRO]::Debug: 20240114 08:03:19.266078 UTC 160596 INFO  Started background task here! - llamaCPP.cc:487
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1538][            update_slots] all slots are idle and system prompt is empty, clear the KV cache
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  876][   launch_slot_with_data] slot 0 is processing [task id: 0]
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1728][            update_slots] slot 0 : kv cache rm - [0, end)

2024-01-14T08:03:19.548Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  472][           print_timings] 
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  477][           print_timings] print_timings: prompt eval time =     180.72 ms /     2 tokens (   90.36 ms per token,    11.07 tokens per second)
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  482][           print_timings] print_timings:        eval time =      93.73 ms /     4 runs   (   23.43 ms per token,    42.68 tokens per second)
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  484][           print_timings] print_timings:       total time =     274.45 ms

2024-01-14T08:03:19.548Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][            update_slots] slot 0 released (7 tokens in cache)

2024-01-14T08:03:19.670Z [NITRO]::Debug: 20240114 08:03:19.548062 UTC 160596 INFO  {"content":" and welcome to my blog","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf","n_ctx":32768,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":true,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":93.731,"predicted_n":4,"predicted_per_second":42.675315530614206,"predicted_per_token_ms":23.43275,"prompt_ms":180.722,"prompt_n":2,"prompt_per_second":11.06672126249156,"prompt_per_token_ms":90.361},"tokens_cached":6,"tokens_evaluated":2,"tokens_predicted":4,"truncated":false} - llamaCPP.cc:135
20240114 08:03:19.668054 UTC 160598 INFO  Resolved request for task_id:1 - llamaCPP.cc:289
20240114 08:03:19.668203 UTC 160598 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:533
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  876][   launch_slot_with_data] slot 0 is processing [task id: 1]

2024-01-14T08:03:19.670Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1728][            update_slots] slot 0 : kv cache rm - [0, end)

2024-01-14T08:03:30.523Z [NITRO]::Debug: [1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  472][           print_timings] 
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  477][           print_timings] print_timings: prompt eval time =     312.66 ms /    53 tokens (    5.90 ms per token,   169.51 tokens per second)
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  482][           print_timings] print_timings:        eval time =   10539.71 ms /   325 runs   (   32.43 ms per token,    30.84 tokens per second)
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  484][           print_timings] print_timings:       total time =   10852.36 ms

2024-01-14T08:03:30.523Z [NITRO]::Debug: [1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][            update_slots] slot 0 released (379 tokens in cache)

2024-01-14T08:03:30.535Z [NITRO]::Debug: 20240114 08:03:30.522973 UTC 160598 INFO  reached result stop - llamaCPP.cc:327
20240114 08:03:30.523761 UTC 160598 INFO  Connection closed or buffer is null. Reset context - llamaCPP.cc:297
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][            update_slots] slot 0 released (379 tokens in cache)

2024-01-14T08:03:35.299Z [NITRO]::Debug: Request to kill Nitro
2024-01-14T08:03:35.333Z [NITRO]::Error: ggml_metal_free: deallocating

2024-01-14T08:03:35.398Z [NITRO]::Debug: 20240114 08:03:35.304921 UTC 160599 INFO  Program is exitting, goodbye! - processManager.cc:8
20240114 08:03:35.305367 UTC 160599 INFO  changed to false - llamaCPP.cc:538
20240114 08:03:35.310214 UTC 160646 INFO  Background task stopped!  - llamaCPP.cc:529
20240114 08:03:35.311003 UTC 160646 INFO  KV cache cleared! - llamaCPP.cc:531

2024-01-14T08:03:35.402Z [NITRO]::Debug: Nitro process is terminated
2024-01-14T08:03:35.402Z [NITRO]::Debug: Nitro exited with code: 0

Additional context
Here is my custom model.json file:

 {
    "source_url": "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "id": "mistral-instruct-7b-q4-v0.2",
    "object": "model",
    "name": "Mistral Instruct 7B Q4 v0.2",
    "version": "1.0",
    "description": "This is a 4-bit quantized iteration of MistralAI's Mistral Instruct 7b model, specifically designed for a comprehensive understanding through training on extensive internet data.",
    "format": "gguf",
    "settings": {
      "ctx_len": 32768,
      "prompt_template": "<s>[INST]{prompt}\n[/INST]"
    },
    "parameters": {
      "max_tokens": 4096
    },
    "metadata": {
      "author": "MistralAI, The Bloke",
      "tags": ["Featured", "7B", "Foundational Model"],
      "size": 4370000000,
      "cover": "https://raw.githubusercontent.com/janhq/jan/main/models/mistral-ins-7b-q4/cover.png"
    },
    "engine": "nitro"
  }

The text was updated successfully, but these errors were encountered:

leitdeux added the type: bug Something isn't working label Jan 14, 2024

github-project-automation bot added this to Menlo Jan 14, 2024

Van-QA added the P1: important Important feature / fix label Jan 14, 2024

Van-QA assigned louis-menlo Jan 14, 2024

Van-QA added this to the v0.4.4 milestone Jan 14, 2024

Van-QA added P0: critical Mission critical and removed P1: important Important feature / fix labels Jan 14, 2024

freelerobot added P1: important Important feature / fix and removed P0: critical Mission critical labels Jan 14, 2024

louis-menlo added a commit that referenced this issue Jan 14, 2024

fix: #1569 - Does not apply thread settings when loading model

fb9b800

louis-menlo mentioned this issue Jan 14, 2024

fix: #1569 - Does not apply thread settings when loading model #1576

Merged

3 tasks

louis-menlo added a commit that referenced this issue Jan 14, 2024

fix: #1569 - Does not apply thread settings when loading model

c41259b

louis-menlo closed this as completed in ca28fe5 Jan 14, 2024

github-project-automation bot moved this to Done in Menlo Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Context Length slider's value is not considered when performing inference #1569

bug: Context Length slider's value is not considered when performing inference #1569

leitdeux commented Jan 14, 2024

bug: Context Length slider's value is not considered when performing inference #1569

bug: Context Length slider's value is not considered when performing inference #1569

Comments

leitdeux commented Jan 14, 2024