-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLAMA_CPP plugin - basic version with direct file loading #891
LLAMA_CPP plugin - basic version with direct file loading #891
Conversation
a7c223a
to
53a3438
Compare
8b19bef
to
c3da5c5
Compare
|
||
void LlamaCppModel::export_model(std::ostream& output_stream) const { | ||
std::ifstream in(m_gguf_fname, std::ios::binary); | ||
output_stream << in.rdbuf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to implement this method at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pure virtual in ov::ICompiledModel
, and the only possible implementation in this flow is as easy as shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or throw exception ;-)
|
||
const int64_t* sequence_start_ptr = data_ptr /* + seq_idx */; | ||
|
||
for (size_t tok_idx = 0; tok_idx < sequence_length; ++tok_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I right that llama.cpp processes prompt token by token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact loop just adds tokens to the "llama.cpp batch", where the "batch" are the tokens for a single text sequence to be processed. llama_batch_add_reimpl
is just a carbon copy of the internal helper function in llama.cpp https://github.com/ggerganov/llama.cpp/blob/2c4fb69246834503db7b78bcbedcef506bbc60c4/common/common.cpp#L1328, filling this struct can probably be done better, but I didn't want to move too far from llama.cpp at that moment.
const int64_t token_id = sequence_start_ptr[tok_idx]; | ||
llama_batch_add_reimpl(batch, | ||
token_id, | ||
*(m_compiled_model_ptr->num_tokens_processed_ptr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to store this value? you can obtain via position_ids
parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes the job easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Can we have a README file that explains how to build and setup the plugin, prepare the model and run inference? |
…#23432) Added an extra conditional branch specifically for LLAMA_CPP_* plugins openvinotoolkit/openvino_contrib#891) that need to manage loading the model directly from disk on their own without instantiating ov::Model.
@vshampor, can you please try two subsequent text generations (with two prompts). I faced problems on my end and the reason can be in the KV-cache reset. |
I checked differently and it seems like it works but I am getting errors in another workflow (w/ llm_bench) but maybe it is my problem. |
17bec0c
to
7046854
Compare
a8bf767
to
53fe441
Compare
|
||
LlamaCppModel(const std::shared_ptr<ov::Model>& ov_model, | ||
std::istream& input_file, | ||
const std::shared_ptr<const IPlugin>& plugin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need 2 ctors above? I supposed we agreed to use gguf_fname
only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctors' definitions had only THROW_NOT_IMPLEMENTED
inside those, but ok, I removed the ctors and now throw not-implementeds in corresponding compile_model
overloads (can't remove these since they are pure virtual).
public: | ||
LlamaCppState() = delete; | ||
LlamaCppState(const std::shared_ptr<const LlamaCppModel>& model_ptr) : m_model_ptr(model_ptr), IVariableState("llama_cpp_state") {} | ||
void reset() override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also implement get_state and set_state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that this could be done to manually set the kv-cache, but I would need a reference to a real use case first so that I could set up some kind of acceptance testing for this.
We need to make CI green before merge |
d019d8f
to
cbc77b7
Compare
cbc77b7
to
de225a5
Compare
types: | ||
- opened | ||
- reopened | ||
- synchronize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use pull_request
, w/o any types
, it covers the synchronize
that, in turn, covers the opened
and reopened
types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
jobs: | ||
build_ubuntu20: | ||
runs-on: ubuntu-20.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use a more powerful runner for the build job, something like ubuntu-20.04-8-cores
, to reduce the build time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I hope that the self-hosted runner pool is big enough to actually result in an improvement in the total runtime of the check over the github-hosted pool.
…)" This reverts commit 8759969.
…openvinotoolkit#23432) Added an extra conditional branch specifically for LLAMA_CPP_* plugins openvinotoolkit/openvino_contrib#891) that need to manage loading the model directly from disk on their own without instantiating ov::Model.
Adds the plugin with the
LLAMA_CPP
device name which performs inference using thelibllama.so
internally while providing the familiar OV API to the user. The GGUF files are loaded directly in thecore.compile_model
by providing the path to the .gguf on disk.