-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPORTANT: Introduce C-style API - Major Refactoring #370
Conversation
our lord and saviour @ggerganov has answered our prayers ! 😄 |
!!!! This is exactly what we we're looking for. While I was talking the talk in various PR's and discussions, you were already walking the walk. Kudos ! |
// TODO: show sample usage | ||
// | ||
|
||
struct llama_context; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really cool if the memory in the llama context and the actual model parameters were in separate structs, so you could initialize a model from a file, then create multiple contexts from it and inference them in parallel. this would be useful if you want to have multiple prompts going at the same time so they can all share the model weights, but have their own k/v memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be useful if the thread safety of each function was mentioned. For example, can I call llama_eval()
and llama_tokenize()
concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will be added via llama_state
similar to the way it is implemented in whisper.cpp
.
I didn't include it in the first pass of the implementation in order to deal with a smaller change. Will probably add this on Thursday with a second pass. It will not change the existing API - it will simply extend it with new per-state calls.
|
||
for (int i = 0; i < nb; i++) { | ||
float min = FLT_MAX; | ||
float max = -FLT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This went from
float max = std::numeric_limits<float>::min();
to
float max = -FLT_MAX;
Can't really argue against it, but mentioning it because it's easy to lose these details...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly as it should be 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is FLT_MAX standard C++? Edit: Yes, part of limits.h
LLAMA_API llama_token llama_token_bos(); | ||
LLAMA_API llama_token llama_token_eos(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: expand the BOS and EOS acronyms here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BOS/EOS abbreviations are used by SentencePiece itself (and in other tokenizers). I think the acronyms are probably fine in context here.
auto lparams = llama_context_default_params(); | ||
|
||
lparams.f16_kv = params.memory_f16; | ||
lparams.logits_all = params.perplexity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing parameters:
lparams.n_parts = params.n_parts;
lparams.n_ctx = params.n_ctx;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are initialized with defaults from the llama_context_default_params()
call above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not copying these values from params
breaks their respective command line switches (--n_parts and -c).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops - missed that!
alpaca 30B q4_0 opinion:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to run the --perplexity option:
llama_tokenize: too many tokens
perplexity : calculating perplexity over 0 chunks
$ wc -m wikitext-2-raw/wiki.test.raw
1288556 wikitext-2-raw/wiki.test.raw
main.cpp
Outdated
|
||
return s.c_str(); | ||
} | ||
|
||
int main(int argc, char ** argv) { | ||
ggml_time_init(); | ||
const int64_t t_main_start_us = ggml_time_us(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the t_main_start_us
variable is now unused, and i think it needs to be replaced with a call to llama_reset_timings(ctx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, ggml_time_init()
too. thatone moved to llama_init_from_file()
, llama_reset_timings(ctx)
should likely be there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t_sample_us
and t_predict_us
in lines 229-230 are also unused now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved now
int n_tokens, | ||
int n_past, | ||
int n_threads) { | ||
if (!llama_eval_internal(*ctx, tokens, n_tokens, n_past, n_threads)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evaluation timings are broken. ctx->n_eval
should be increased here and this function call should be timed and added to ctx->t_eval_us
.
temp, | ||
repeat_penalty); | ||
|
||
ctx->t_sample_us += ggml_time_us() - t_start_sample_us; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx->n_sample
should also be increased here.
llama.cpp
Outdated
|
||
if (n_max_tokens < (int) res.size()) { | ||
fprintf(stderr, "%s: too many tokens\n", __func__); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning 1 here is ambiguous, it is impossible to tell if there was an error or if only 1 token was returned. Consider returning a negative number to indicate failure (maybe the number of tokens negated?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation in llama.h it should return -1 on failure. However, may be necessary to rethink or add other API to support cases like the perplexity computation in which a big document has to be tokenized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Lets fix the perplexity computation on master
to avoid blocking the development for too long.
With the new API change that you suggested, we can now query for the number of needed tokens and allocate the necessary memory. Some improvements might also be possible
Thank you all for the detailed look! TODO: The perplexity computation needs a small fix - see comment above by @slaren |
fix: Make LLamaState pickleable for disk cache
This is pretty big change, but it has to happen in order to allow building more examples by reusing the code properly. It will also allow other programming languages to interface with the code.
ggml_quantize_q4_0()
andggml_quantize_q4_1()
fromutils
toggml
llama.h
andllama.cpp
that implement most of the model handling that was previously inmain.cpp
llama
libraryllama
library, exposed through the APIPlease test that everything is working. I haven't tested the
perplexity
parameter at all.Sorry if this conflicts with your ongoing work!