Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iOS Exception: Could not load model at ... #44

Open
LukeMoody01 opened this issue Nov 26, 2024 · 12 comments
Open

iOS Exception: Could not load model at ... #44

LukeMoody01 opened this issue Nov 26, 2024 · 12 comments

Comments

@LukeMoody01
Copy link

Hi there,

First off, thanks for the hard work creating this package.

I am currently having some issues getting the package to run on iOS. I am currently loading both dynamic libraries libggml.dylib & libllama.dylib as such:

 Llama.libraryPath = "libllama.dylib";

The issue comes when trying to load a model like so:

 final model = 'ggml-vocab-gpt-2.gguf';

    final directory = await getApplicationDocumentsDirectory();
    final filePath = '${directory.path}/$model';

    final fileExists = await File(filePath).exists();
    if (!fileExists) {
      final byteData = await rootBundle.load('assets/ai/$model');
      final file = File(filePath);
      await file.writeAsBytes(byteData.buffer
          .asUint8List(byteData.offsetInBytes, byteData.lengthInBytes));
    }

    Llama llama = Llama(
      filePath,
      modelParams,
      contextParams,
      samplerParams,
    );

even doing it as a raw path:

 Llama llama = Llama(
      /Users/Luke/Workspace/llama.cpp/models/tinyllama-2-1b-miniguanaco.Q3_K_L.gguf,
      modelParams,
      contextParams,
      samplerParams,
    );

also does not work.

We are using the latest dev branch (commit hash 231a3e8).

Any help or guidance here would be greatly appreciated.

Error:

Could not load model XYZ
flutter: Error: LateInitializationError: Field 'context' has not been initialized.

@LukeMoody01
Copy link
Author

Upon further debugging, here is a lovely xCode output:

llama_load_model_from_file: using device Metal (Apple A15 GPU) - 2727 MiB free
llama_model_loader: loaded meta data with 16 key-value pairs and 0 tensors from /var/mobile/Containers/Data/Application/C44C7A73-3C3E-4778-953F-B3F8412A71BF/Documents/ggml-vocab-gpt-2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gpt2
llama_model_loader: - kv 1: general.name str = gpt-2
llama_model_loader: - kv 2: gpt2.block_count u32 = 12
llama_model_loader: - kv 3: gpt2.context_length u32 = 1024
llama_model_loader: - kv 4: gpt2.embedding_length u32 = 768
llama_model_loader: - kv 5: gpt2.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: gpt2.attention.head_count u32 = 12
llama_model_loader: - kv 7: gpt2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 8: general.file_type u32 = 1
llama_model_loader: - kv 9: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 10: tokenizer.ggml.pre str = gpt-2
llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,50257] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,50257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 14: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 50256
llm_load_vocab: special tokens cache size = 1
llm_load_vocab: token to piece cache size = 0.3060 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gpt2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 50257
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1024
llm_load_print_meta: n_embd = 768
llm_load_print_meta: n_layer = 12
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 12
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 768
llm_load_print_meta: n_embd_v_gqa = 768
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 3072
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = -1
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1024
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 0.1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 0.00 K
llm_load_print_meta: model size = 0.00 MiB (nan BPW)
llm_load_print_meta: general.name = gpt-2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOT token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.03 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found
llama_load_model_from_file: failed to load model

@LukeMoody01
Copy link
Author

After even more investigation, it seems the model I was using was malformed. Using a new model, I now get the following errors:

/llama.cpp/src/llama-sampling.cpp:279: GGML_ASSERT(cur_p.selected >= 0 && cur_p.selected < (int32_t) cur_p.size) failed

Any idea on this?

Cheers!

@LukeMoody01
Copy link
Author

Further updates, got past the above GGML_ASSERTS by modifying the model params and context params.

Hit another wall when it comes to encoding.

llama.cpp/src/llama.cpp:15342: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed

Any ideas here?

@netdur
Copy link
Owner

netdur commented Nov 26, 2024

@LukeMoody01 I could run tinyllama-2-1b-miniguanaco.Q3_K_L.gguf with script in example folder, both simple.dart and chat.dart --- could it be that your prompt is larger than context? I will try investigate
also thank you for trying dev

@netdur
Copy link
Owner

netdur commented Nov 26, 2024

@LukeMoody01 please try again

@LukeMoody01
Copy link
Author

Hey @netdur,

I will give it a go using the model you just mentioned! I was using a flan T5 model.

I am also using iOS as well.

I'll get back to you soon.

@LukeMoody01
Copy link
Author

Alright, so it "works".

The AI likes to cut off its response very early, but I feel like that could be a config issue on my end. How do you usually allow the A.I to have lengthier responses? @netdur

@netdur
Copy link
Owner

netdur commented Nov 27, 2024 via email

@LukeMoody01
Copy link
Author

That'd be awesome. Great work @netdur 😄

@LukeMoody01
Copy link
Author

Can I also ask, where do you find the models you test with? As some of the models I find on huggingface usually throw errors such as the ones above, and "GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor")"

@netdur
Copy link
Owner

netdur commented Nov 29, 2024

@LukeMoody01 I am currently testing on ios, how do you build llama.cpp for ios?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants