Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Context Length slider's value is not considered when performing inference #1569

Closed
leitdeux opened this issue Jan 14, 2024 · 0 comments
Closed
Assignees
Labels
P1: important Important feature / fix type: bug Something isn't working
Milestone

Comments

@leitdeux
Copy link

Describe the bug
I have a manually installed model (Mistral Instruct 7B, v0.2) and given it a model.json file with a max context length field of 32,768. In Jan's Engine Parameters "Context Length" section I can observe that the slider's max value is 32,768, as expected. However, when submitting a message, it appears as though the max context length is always used regardless of the slider's actual value. On macOS, I can observe very high RAM usage in Activity Monitor and in Jan, even when I lower the context length via the slider to a much lower value such as 4,096, which normally incurs much lower memory usage. Note: this issue wasn't present until the last 2 updates or so.

Steps to reproduce
Steps to reproduce the behavior:

  1. Open Mistral 7b's model.json and set its ctx_len setting to 32768
  2. Launch Jan and create a thread and set its Context Length to 4096
  3. Submit a message, and observe memory usage indicative of a context length of 32,768

Expected behavior
The value of Context Length slider should be used instead of always the max possible context length value defined in a model's model.json.

Screenshots
If applicable, add screenshots to help explain your issue.

Environment details

  • Operating System: macOS Sonoma 14.2.1
  • Jan Version: 0.4.3-143
  • Processor: Apple M2 Pro
  • RAM: 16GB

Logs

�[0m20240114 08:03:17.803817 UTC 160592 INFO  Nitro version:  - main.cc:44
20240114 08:03:17.804000 UTC 160592 INFO  Server started, listening at: 127.0.0.1:3928 - main.cc:48
20240114 08:03:17.804000 UTC 160592 INFO  Please load your model - main.cc:49
20240114 08:03:17.804001 UTC 160592 INFO  Number of thread is:10 - main.cc:52
20240114 08:03:17.805000 UTC 160592 INFO  Not found models folder, start server as usual - llamaCPP.h:2510
{"timestamp":1705219398,"level":"INFO","function":"loadModelImpl","line":478,"message":"system info","n_threads":10,"total_threads":10,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}

2024-01-14T08:03:18.116Z [NITRO]::Error: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8

2024-01-14T08:03:18.116Z [NITRO]::Error: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama

2024-01-14T08:03:18.123Z [NITRO]::Error: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...

2024-01-14T08:03:18.134Z [NITRO]::Error: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...

2024-01-14T08:03:18.135Z [NITRO]::Error: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors

2024-01-14T08:03:18.151Z [NITRO]::Error: llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear

2024-01-14T08:03:18.151Z [NITRO]::Error: llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB

2024-01-14T08:03:18.207Z [NITRO]::Error: ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  4166.08 MiB, ( 4166.14 / 10922.67)
llm_load_tensors: system memory used  = 4165.48 MiB

2024-01-14T08:03:18.208Z [NITRO]::Error: ...................
2024-01-14T08:03:18.208Z [NITRO]::Error: ..............................
2024-01-14T08:03:18.208Z [NITRO]::Error: ...................
2024-01-14T08:03:18.208Z [NITRO]::Error: ...........................

2024-01-14T08:03:18.209Z [NITRO]::Error: llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating

2024-01-14T08:03:18.209Z [NITRO]::Error: ggml_metal_init: found device: Apple M2 Pro

2024-01-14T08:03:18.209Z [NITRO]::Error: ggml_metal_init: picking default device: Apple M2 Pro

2024-01-14T08:03:18.210Z [NITRO]::Error: ggml_metal_init: default.metallib not found, loading from source

2024-01-14T08:03:18.210Z [NITRO]::Error: ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/ef/jan/extensions/@janhq/inference-nitro-extension/dist/bin/mac-arm64/ggml-metal.metal'

2024-01-14T08:03:18.214Z [NITRO]::Error: ggml_metal_init: GPU name:   Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU

2024-01-14T08:03:18.222Z [NITRO]::Error: ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  4096.00 MiB, ( 8263.70 / 10922.67)

2024-01-14T08:03:18.551Z [NITRO]::Error: llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB

2024-01-14T08:03:18.551Z [NITRO]::Error: ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 8263.72 / 10922.67)

2024-01-14T08:03:18.551Z [NITRO]::Error: llama_build_graph: non-view tensors processed: 676/676

2024-01-14T08:03:18.552Z [NITRO]::Error: llama_new_context_with_model: compute buffer total size = 2139.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2136.02 MiB, (10399.72 / 10922.67)

2024-01-14T08:03:19.271Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  590][              initialize] Available slots:
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  598][              initialize]  -> Slot 0 - max context: 32768

2024-01-14T08:03:19.274Z [NITRO]::Debug: 20240114 08:03:19.266078 UTC 160596 INFO  Started background task here! - llamaCPP.cc:487
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1538][            update_slots] all slots are idle and system prompt is empty, clear the KV cache
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  876][   launch_slot_with_data] slot 0 is processing [task id: 0]
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1728][            update_slots] slot 0 : kv cache rm - [0, end)

2024-01-14T08:03:19.548Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  472][           print_timings] 
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  477][           print_timings] print_timings: prompt eval time =     180.72 ms /     2 tokens (   90.36 ms per token,    11.07 tokens per second)
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  482][           print_timings] print_timings:        eval time =      93.73 ms /     4 runs   (   23.43 ms per token,    42.68 tokens per second)
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  484][           print_timings] print_timings:       total time =     274.45 ms

2024-01-14T08:03:19.548Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][            update_slots] slot 0 released (7 tokens in cache)

2024-01-14T08:03:19.670Z [NITRO]::Debug: 20240114 08:03:19.548062 UTC 160596 INFO  {"content":" and welcome to my blog","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"min_p":0.05000000074505806,"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"/Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf","n_ctx":32768,"n_keep":0,"n_predict":2,"n_probs":0,"penalize_nl":true,"penalty_prompt_tokens":[],"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temperature":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0,"use_penalty_prompt_tokens":false},"model":"/Users/ef/jan/models/mistral-instruct-7b-q4-v0.2/mistral-7b-instruct-v0.2.Q4_K_M.gguf","prompt":"Hello","slot_id":0,"stop":true,"stopped_eos":false,"stopped_limit":true,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":93.731,"predicted_n":4,"predicted_per_second":42.675315530614206,"predicted_per_token_ms":23.43275,"prompt_ms":180.722,"prompt_n":2,"prompt_per_second":11.06672126249156,"prompt_per_token_ms":90.361},"tokens_cached":6,"tokens_evaluated":2,"tokens_predicted":4,"truncated":false} - llamaCPP.cc:135
20240114 08:03:19.668054 UTC 160598 INFO  Resolved request for task_id:1 - llamaCPP.cc:289
20240114 08:03:19.668203 UTC 160598 DEBUG [makeHeaderString] send stream with transfer-encoding chunked - HttpResponseImpl.cc:533
[1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  876][   launch_slot_with_data] slot 0 is processing [task id: 1]

2024-01-14T08:03:19.670Z [NITRO]::Debug: [1705219399] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1728][            update_slots] slot 0 : kv cache rm - [0, end)

2024-01-14T08:03:30.523Z [NITRO]::Debug: [1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  472][           print_timings] 
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  477][           print_timings] print_timings: prompt eval time =     312.66 ms /    53 tokens (    5.90 ms per token,   169.51 tokens per second)
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  482][           print_timings] print_timings:        eval time =   10539.71 ms /   325 runs   (   32.43 ms per token,    30.84 tokens per second)
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h:  484][           print_timings] print_timings:       total time =   10852.36 ms

2024-01-14T08:03:30.523Z [NITRO]::Debug: [1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][            update_slots] slot 0 released (379 tokens in cache)

2024-01-14T08:03:30.535Z [NITRO]::Debug: 20240114 08:03:30.522973 UTC 160598 INFO  reached result stop - llamaCPP.cc:327
20240114 08:03:30.523761 UTC 160598 INFO  Connection closed or buffer is null. Reset context - llamaCPP.cc:297
[1705219410] [/Users/jan/actions-runner/_work/nitro/nitro/controllers/llamaCPP.h: 1591][            update_slots] slot 0 released (379 tokens in cache)

2024-01-14T08:03:35.299Z [NITRO]::Debug: Request to kill Nitro
2024-01-14T08:03:35.333Z [NITRO]::Error: ggml_metal_free: deallocating

2024-01-14T08:03:35.398Z [NITRO]::Debug: 20240114 08:03:35.304921 UTC 160599 INFO  Program is exitting, goodbye! - processManager.cc:8
20240114 08:03:35.305367 UTC 160599 INFO  changed to false - llamaCPP.cc:538
20240114 08:03:35.310214 UTC 160646 INFO  Background task stopped!  - llamaCPP.cc:529
20240114 08:03:35.311003 UTC 160646 INFO  KV cache cleared! - llamaCPP.cc:531

2024-01-14T08:03:35.402Z [NITRO]::Debug: Nitro process is terminated
2024-01-14T08:03:35.402Z [NITRO]::Debug: Nitro exited with code: 0

Additional context
Here is my custom model.json file:

 {
    "source_url": "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "id": "mistral-instruct-7b-q4-v0.2",
    "object": "model",
    "name": "Mistral Instruct 7B Q4 v0.2",
    "version": "1.0",
    "description": "This is a 4-bit quantized iteration of MistralAI's Mistral Instruct 7b model, specifically designed for a comprehensive understanding through training on extensive internet data.",
    "format": "gguf",
    "settings": {
      "ctx_len": 32768,
      "prompt_template": "<s>[INST]{prompt}\n[/INST]"
    },
    "parameters": {
      "max_tokens": 4096
    },
    "metadata": {
      "author": "MistralAI, The Bloke",
      "tags": ["Featured", "7B", "Foundational Model"],
      "size": 4370000000,
      "cover": "https://raw.githubusercontent.com/janhq/jan/main/models/mistral-ins-7b-q4/cover.png"
    },
    "engine": "nitro"
  }
@leitdeux leitdeux added the type: bug Something isn't working label Jan 14, 2024
@Van-QA Van-QA added the P1: important Important feature / fix label Jan 14, 2024
@Van-QA Van-QA added this to the v0.4.4 milestone Jan 14, 2024
@Van-QA Van-QA added P0: critical Mission critical and removed P1: important Important feature / fix labels Jan 14, 2024
@freelerobot freelerobot added P1: important Important feature / fix and removed P0: critical Mission critical labels Jan 14, 2024
@github-project-automation github-project-automation bot moved this to Done in Menlo Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1: important Important feature / fix type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants