Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLAMA_CPP plugin - basic version with direct file loading #891

Merged
merged 27 commits into from
Mar 25, 2024

Conversation

vshampor
Copy link
Contributor

@vshampor vshampor commented Mar 13, 2024

Adds the plugin with the LLAMA_CPP device name which performs inference using the libllama.so internally while providing the familiar OV API to the user. The GGUF files are loaded directly in the core.compile_model by providing the path to the .gguf on disk.

@github-actions github-actions bot added category: build OpenVINO cmake script / infra dependencies Pull requests that update a dependency file labels Mar 13, 2024
@vshampor vshampor changed the title LLAMA_CPP plugin - basic version with direct file LLAMA_CPP plugin - basic version with direct file loading Mar 13, 2024
@github-actions github-actions bot added the category: CI OpenVINO public CI label Mar 13, 2024
@vshampor vshampor changed the title LLAMA_CPP plugin - basic version with direct file loading [DRAFT] LLAMA_CPP plugin - basic version with direct file loading Mar 13, 2024
@vshampor vshampor marked this pull request as ready for review March 13, 2024 16:01
@vshampor vshampor requested review from a team as code owners March 13, 2024 16:01
@vshampor vshampor force-pushed the llama_cpp_plugin branch 5 times, most recently from 8b19bef to c3da5c5 Compare March 13, 2024 16:17

void LlamaCppModel::export_model(std::ostream& output_stream) const {
std::ifstream in(m_gguf_fname, std::ios::binary);
output_stream << in.rdbuf();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to implement this method at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pure virtual in ov::ICompiledModel, and the only possible implementation in this flow is as easy as shown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or throw exception ;-)


const int64_t* sequence_start_ptr = data_ptr /* + seq_idx */;

for (size_t tok_idx = 0; tok_idx < sequence_length; ++tok_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right that llama.cpp processes prompt token by token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact loop just adds tokens to the "llama.cpp batch", where the "batch" are the tokens for a single text sequence to be processed. llama_batch_add_reimpl is just a carbon copy of the internal helper function in llama.cpp https://github.com/ggerganov/llama.cpp/blob/2c4fb69246834503db7b78bcbedcef506bbc60c4/common/common.cpp#L1328, filling this struct can probably be done better, but I didn't want to move too far from llama.cpp at that moment.

const int64_t token_id = sequence_start_ptr[tok_idx];
llama_batch_add_reimpl(batch,
token_id,
*(m_compiled_model_ptr->num_tokens_processed_ptr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to store this value? you can obtain via position_ids parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes the job easier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AlexKoff88
Copy link
Contributor

Can we have a README file that explains how to build and setup the plugin, prepare the model and run inference?

github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this pull request Mar 14, 2024
…#23432)

Added an extra conditional branch specifically for LLAMA_CPP_* plugins
openvinotoolkit/openvino_contrib#891) that need
to manage loading the model directly from disk on their own without
instantiating ov::Model.
@AlexKoff88
Copy link
Contributor

@vshampor, can you please try two subsequent text generations (with two prompts). I faced problems on my end and the reason can be in the KV-cache reset.

@AlexKoff88
Copy link
Contributor

@vshampor, can you please try two subsequent text generations (with two prompts). I faced problems on my end and the reason can be in the KV-cache reset.

I checked differently and it seems like it works but I am getting errors in another workflow (w/ llm_bench) but maybe it is my problem.

@vshampor vshampor force-pushed the llama_cpp_plugin branch 8 times, most recently from 17bec0c to 7046854 Compare March 18, 2024 09:50

LlamaCppModel(const std::shared_ptr<ov::Model>& ov_model,
std::istream& input_file,
const std::shared_ptr<const IPlugin>& plugin);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need 2 ctors above? I supposed we agreed to use gguf_fname only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctors' definitions had only THROW_NOT_IMPLEMENTED inside those, but ok, I removed the ctors and now throw not-implementeds in corresponding compile_model overloads (can't remove these since they are pure virtual).

public:
LlamaCppState() = delete;
LlamaCppState(const std::shared_ptr<const LlamaCppModel>& model_ptr) : m_model_ptr(model_ptr), IVariableState("llama_cpp_state") {}
void reset() override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also implement get_state and set_state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that this could be done to manually set the kv-cache, but I would need a reference to a real use case first so that I could set up some kind of acceptance testing for this.

@ilya-lavrenov ilya-lavrenov added this to the 2024.1 milestone Mar 22, 2024
@ilya-lavrenov
Copy link
Contributor

We need to make CI green before merge

@vshampor vshampor force-pushed the llama_cpp_plugin branch 2 times, most recently from d019d8f to cbc77b7 Compare March 25, 2024 10:01
Comment on lines 5 to 8
types:
- opened
- reopened
- synchronize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use pull_request, w/o any types, it covers the synchronize that, in turn, covers the opened and reopened types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


jobs:
build_ubuntu20:
runs-on: ubuntu-20.04
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use a more powerful runner for the build job, something like ubuntu-20.04-8-cores, to reduce the build time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I hope that the self-hosted runner pool is big enough to actually result in an improvement in the total runtime of the check over the github-hosted pool.

@vshampor vshampor requested a review from akashchi March 25, 2024 16:29
@ilya-lavrenov ilya-lavrenov merged commit 8759969 into openvinotoolkit:master Mar 25, 2024
4 of 6 checks passed
akladiev added a commit that referenced this pull request Mar 26, 2024
akladiev added a commit that referenced this pull request Mar 26, 2024
alvoron pushed a commit to alvoron/openvino that referenced this pull request Apr 29, 2024
…openvinotoolkit#23432)

Added an extra conditional branch specifically for LLAMA_CPP_* plugins
openvinotoolkit/openvino_contrib#891) that need
to manage loading the model directly from disk on their own without
instantiating ov::Model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CI OpenVINO public CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants