You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I have a manually installed model (Mistral Instruct 7B, v0.2) and given it a model.json file with a max context length field of 32,768. In Jan's Engine Parameters "Context Length" section I can observe that the slider's max value is 32,768, as expected. However, when submitting a message, it appears as though the max context length is always used regardless of the slider's actual value. On macOS, I can observe very high RAM usage in Activity Monitor and in Jan, even when I lower the context length via the slider to a much lower value such as 4,096, which normally incurs much lower memory usage. Note: this issue wasn't present until the last 2 updates or so.
Steps to reproduce
Steps to reproduce the behavior:
Open Mistral 7b's model.json and set its ctx_len setting to 32768
Launch Jan and create a thread and set its Context Length to 4096
Submit a message, and observe memory usage indicative of a context length of 32,768
Expected behavior
The value of Context Length slider should be used instead of always the max possible context length value defined in a model's model.json.
Screenshots
If applicable, add screenshots to help explain your issue.
Environment details
Operating System: macOS Sonoma 14.2.1
Jan Version: 0.4.3-143
Processor: Apple M2 Pro
RAM: 16GB
Logs
�[0m20240114 08:03:17.803817 UTC 160592 INFO Nitro version: - main.cc:44
20240114 08:03:17.804000 UTC 160592 INFO Server started, listening at: 127.0.0.1:3928 - main.cc:48
20240114 08:03:17.804000 UTC 160592 INFO Please load your model - main.cc:49
20240114 08:03:17.804001 UTC 160592 INFO Number of thread is:10 - main.cc:52
20240114 08:03:17.805000 UTC 160592 INFO Not found models folder, start server as usual - llamaCPP.h:2510
{"timestamp":1705219398,"level":"INFO","function":"loadModelImpl","line":478,"message":"system info","n_threads":10,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
2024-01-14T08:03:18.116Z [NITRO]::Error: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
2024-01-14T08:03:18.116Z [NITRO]::Error: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
2024-01-14T08:03:18.123Z [NITRO]::Error: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
2024-01-14T08:03:18.134Z [NITRO]::Error: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
2024-01-14T08:03:18.135Z [NITRO]::Error: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
2024-01-14T08:03:18.151Z [NITRO]::Error: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
2024-01-14T08:03:18.151Z [NITRO]::Error: llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
2024-01-14T08:03:18.207Z [NITRO]::Error: ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 4166.08 MiB, ( 4166.14 / 10922.67)
llm_load_tensors: system memory used = 4165.48 MiB
2024-01-14T08:03:18.208Z [NITRO]::Error: ...................
2024-01-14T08:03:18.208Z [NITRO]::Error: ..............................
2024-01-14T08:03:18.208Z [NITRO]::Error: ...................
2024-01-14T08:03:18.208Z [NITRO]::Error: ...........................
2024-01-14T08:03:18.209Z [NITRO]::Error: llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
2024-01-14T08:03:18.209Z [NITRO]::Error: ggml_metal_init: found device: Apple M2 Pro
2024-01-14T08:03:18.209Z [NITRO]::Error: ggml_metal_init: picking default device: Apple M2 Pro
2024-01-14T08:03:18.210Z [NITRO]::Error: ggml_metal_init: default.metallib not found, loading from source
2024-01-14T08:03:18.210Z [NITRO]::Error: ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/ef/jan/extensions/@janhq/inference-nitro-extension/dist/bin/mac-arm64/ggml-metal.metal'
2024-01-14T08:03:18.214Z [NITRO]::Error: ggml_metal_init: GPU name: Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
ggml_metal_init: maxTransferRate = built-in GPU
2024-01-14T08:03:18.222Z [NITRO]::Error: ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 4096.00 MiB, ( 8263.70 / 10922.67)
2024-01-14T08:03:18.551Z [NITRO]::Error: llama_new_context_with_model: KV self size = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
2024-01-14T08:03:18.551Z [NITRO]::Error: ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 8263.72 / 10922.67)
2024-01-14T08:03:18.551Z [NITRO]::Error: llama_build_graph: non-view tensors processed: 676/676
2024-01-14T08:03:18.552Z [NITRO]::Error: llama_new_context_with_model: compute buffer total size = 2139.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2136.02 MiB, (10399.72 / 10922.67)
2024-01-14T08:03:19.271Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 590][ initialize] Available slots:
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 598][ initialize] -> Slot 0 - max context: 32768
2024-01-14T08:03:19.274Z [NITRO]::Debug: 20240114 08:03:19.266078 UTC 160596 INFO Started background task here! - llamaCPP.cc:487
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1538][ update_slots] all slots are idle and system prompt is empty, clear the KV cache
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 876][ launch_slot_with_data] slot 0 is processing [task id: 0]
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1728][ update_slots] slot 0 : kv cache rm - [0, end)
2024-01-14T08:03:19.548Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 472][ print_timings]
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 477][ print_timings] print_timings: prompt eval time = 180.72 ms / 2 tokens ( 90.36 ms per token, 11.07 tokens per second)
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 482][ print_timings] print_timings: eval time = 93.73 ms / 4 runs ( 23.43 ms per token, 42.68 tokens per second)
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 484][ print_timings] print_timings: total time = 274.45 ms
2024-01-14T08:03:19.548Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][ update_slots] slot 0 released (7 tokens in cache)
2024-01-14T08:03:19.670Z [NITRO]::Debug: 20240114 08:03:19.548062 UTC 160596 INFO {"content":" and welcome to my blog","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf","n_ctx":32768,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":true,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":93.731,"predicted_n":4,"predicted_per_second":42.675315530614206,"predicted_per_token_ms":23.43275,"prompt_ms":180.722,"prompt_n":2,"prompt_per_second":11.06672126249156,"prompt_per_token_ms":90.361},"tokens_cached":6,"tokens_evaluated":2,"tokens_predicted":4,"truncated":false} - llamaCPP.cc:135
20240114 08:03:19.668054 UTC 160598 INFO Resolved request for task_id:1 - llamaCPP.cc:289
20240114 08:03:19.668203 UTC 160598 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:533
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 876][ launch_slot_with_data] slot 0 is processing [task id: 1]
2024-01-14T08:03:19.670Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1728][ update_slots] slot 0 : kv cache rm - [0, end)
2024-01-14T08:03:30.523Z [NITRO]::Debug: [1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 472][ print_timings]
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 477][ print_timings] print_timings: prompt eval time = 312.66 ms / 53 tokens ( 5.90 ms per token, 169.51 tokens per second)
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 482][ print_timings] print_timings: eval time = 10539.71 ms / 325 runs ( 32.43 ms per token, 30.84 tokens per second)
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 484][ print_timings] print_timings: total time = 10852.36 ms
2024-01-14T08:03:30.523Z [NITRO]::Debug: [1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][ update_slots] slot 0 released (379 tokens in cache)
2024-01-14T08:03:30.535Z [NITRO]::Debug: 20240114 08:03:30.522973 UTC 160598 INFO reached result stop - llamaCPP.cc:327
20240114 08:03:30.523761 UTC 160598 INFO Connection closed or buffer is null. Reset context - llamaCPP.cc:297
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][ update_slots] slot 0 released (379 tokens in cache)
2024-01-14T08:03:35.299Z [NITRO]::Debug: Request to kill Nitro
2024-01-14T08:03:35.333Z [NITRO]::Error: ggml_metal_free: deallocating
2024-01-14T08:03:35.398Z [NITRO]::Debug: 20240114 08:03:35.304921 UTC 160599 INFO Program is exitting, goodbye! - processManager.cc:8
20240114 08:03:35.305367 UTC 160599 INFO changed to false - llamaCPP.cc:538
20240114 08:03:35.310214 UTC 160646 INFO Background task stopped! - llamaCPP.cc:529
20240114 08:03:35.311003 UTC 160646 INFO KV cache cleared! - llamaCPP.cc:531
2024-01-14T08:03:35.402Z [NITRO]::Debug: Nitro process is terminated
2024-01-14T08:03:35.402Z [NITRO]::Debug: Nitro exited with code: 0
Additional context
Here is my custom model.json file:
{
"source_url": "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
"id": "mistral-instruct-7b-q4-v0.2",
"object": "model",
"name": "Mistral Instruct 7B Q4 v0.2",
"version": "1.0",
"description": "This is a 4-bit quantized iteration of MistralAI's Mistral Instruct 7b model, specifically designed for a comprehensive understanding through training on extensive internet data.",
"format": "gguf",
"settings": {
"ctx_len": 32768,
"prompt_template": "<s>[INST]{prompt}\n[/INST]"
},
"parameters": {
"max_tokens": 4096
},
"metadata": {
"author": "MistralAI, The Bloke",
"tags": ["Featured", "7B", "Foundational Model"],
"size": 4370000000,
"cover": "https://raw.githubusercontent.com/janhq/jan/main/models/mistral-ins-7b-q4/cover.png"
},
"engine": "nitro"
}
The text was updated successfully, but these errors were encountered:
Describe the bug
I have a manually installed model (Mistral Instruct 7B, v0.2) and given it a model.json file with a max context length field of 32,768. In Jan's Engine Parameters "Context Length" section I can observe that the slider's max value is 32,768, as expected. However, when submitting a message, it appears as though the max context length is always used regardless of the slider's actual value. On macOS, I can observe very high RAM usage in Activity Monitor and in Jan, even when I lower the context length via the slider to a much lower value such as 4,096, which normally incurs much lower memory usage. Note: this issue wasn't present until the last 2 updates or so.
Steps to reproduce
Steps to reproduce the behavior:
ctx_len
setting to 32768Expected behavior
The value of Context Length slider should be used instead of always the max possible context length value defined in a model's model.json.
Screenshots
If applicable, add screenshots to help explain your issue.
Environment details
Logs
Additional context
Here is my custom model.json file:
The text was updated successfully, but these errors were encountered: