LLAMA_CPP plugin - basic version with direct file loading #891

vshampor · 2024-03-13T10:38:47Z

Adds the plugin with the LLAMA_CPP device name which performs inference using the libllama.so internally while providing the familiar OV API to the user. The GGUF files are loaded directly in the core.compile_model by providing the path to the .gguf on disk.

.gitmodules

modules/llama_cpp_plugin/CMakeLists.txt

modules/llama_cpp_plugin/src/CMakeLists.txt

modules/llama_cpp_plugin/include/compiled_model.hpp

modules/llama_cpp_plugin/src/compiled_model.cpp

ilya-lavrenov · 2024-03-13T22:24:58Z

modules/llama_cpp_plugin/src/compiled_model.cpp

+
+void LlamaCppModel::export_model(std::ostream& output_stream) const {
+    std::ifstream in(m_gguf_fname, std::ios::binary);
+    output_stream << in.rdbuf();


do we need to implement this method at all?

It's pure virtual in ov::ICompiledModel, and the only possible implementation in this flow is as easy as shown.

or throw exception ;-)

ilya-lavrenov · 2024-03-13T22:27:29Z

modules/llama_cpp_plugin/src/infer_request.cpp

+
+    const int64_t* sequence_start_ptr = data_ptr /* + seq_idx */;
+
+    for (size_t tok_idx = 0; tok_idx < sequence_length; ++tok_idx) {


am I right that llama.cpp processes prompt token by token?

This exact loop just adds tokens to the "llama.cpp batch", where the "batch" are the tokens for a single text sequence to be processed. llama_batch_add_reimpl is just a carbon copy of the internal helper function in llama.cpp https://github.com/ggerganov/llama.cpp/blob/2c4fb69246834503db7b78bcbedcef506bbc60c4/common/common.cpp#L1328, filling this struct can probably be done better, but I didn't want to move too far from llama.cpp at that moment.

modules/llama_cpp_plugin/src/infer_request.cpp

ilya-lavrenov · 2024-03-13T22:28:56Z

modules/llama_cpp_plugin/src/infer_request.cpp

+        const int64_t token_id = sequence_start_ptr[tok_idx];
+        llama_batch_add_reimpl(batch,
+                               token_id,
+                               *(m_compiled_model_ptr->num_tokens_processed_ptr),


why do you need to store this value? you can obtain via position_ids parameter

Ok, that makes the job easier

modules/llama_cpp_plugin/build.sh

.github/workflows/llama_cpp_plugin_build_and_test.yml

AlexKoff88 · 2024-03-14T07:01:37Z

Can we have a README file that explains how to build and setup the plugin, prepare the model and run inference?

…#23432) Added an extra conditional branch specifically for LLAMA_CPP_* plugins openvinotoolkit/openvino_contrib#891) that need to manage loading the model directly from disk on their own without instantiating ov::Model.

AlexKoff88 · 2024-03-14T11:30:03Z

@vshampor, can you please try two subsequent text generations (with two prompts). I faced problems on my end and the reason can be in the KV-cache reset.

AlexKoff88 · 2024-03-14T11:38:03Z

@vshampor, can you please try two subsequent text generations (with two prompts). I faced problems on my end and the reason can be in the KV-cache reset.

I checked differently and it seems like it works but I am getting errors in another workflow (w/ llm_bench) but maybe it is my problem.

modules/llama_cpp_plugin/README.md

modules/llama_cpp_plugin/include/compiled_model.hpp

ilya-lavrenov · 2024-03-22T10:59:59Z

modules/llama_cpp_plugin/include/compiled_model.hpp

+
+            LlamaCppModel(const std::shared_ptr<ov::Model>& ov_model,
+                          std::istream& input_file,
+                          const std::shared_ptr<const IPlugin>& plugin);


do we need 2 ctors above? I supposed we agreed to use gguf_fname only

The ctors' definitions had only THROW_NOT_IMPLEMENTED inside those, but ok, I removed the ctors and now throw not-implementeds in corresponding compile_model overloads (can't remove these since they are pure virtual).

modules/llama_cpp_plugin/src/compiled_model.cpp

modules/llama_cpp_plugin/src/plugin.cpp

modules/llama_cpp_plugin/tests/e2e/CMakeLists.txt

ilya-lavrenov · 2024-03-22T11:08:51Z

modules/llama_cpp_plugin/include/state.hpp

+        public:
+            LlamaCppState() = delete;
+            LlamaCppState(const std::shared_ptr<const LlamaCppModel>& model_ptr) : m_model_ptr(model_ptr), IVariableState("llama_cpp_state") {}
+            void reset() override {


we could also implement get_state and set_state

I suppose that this could be done to manually set the kv-cache, but I would need a reference to a real use case first so that I could set up some kind of acceptance testing for this.

modules/llama_cpp_plugin/CMakeLists.txt

ilya-lavrenov · 2024-03-24T15:10:55Z

We need to make CI green before merge

akashchi · 2024-03-25T15:18:36Z

.github/workflows/llama_cpp_plugin_build_and_test.yml

+    types:
+      - opened
+      - reopened
+      - synchronize


You can just use pull_request, w/o any types, it covers the synchronize that, in turn, covers the opened and reopened types.

.github/workflows/llama_cpp_plugin_build_and_test.yml

akashchi · 2024-03-25T15:23:37Z

.github/workflows/llama_cpp_plugin_build_and_test.yml

+
+jobs:
+  build_ubuntu20:
+    runs-on: ubuntu-20.04


It might be better to use a more powerful runner for the build job, something like ubuntu-20.04-8-cores, to reduce the build time.

Done, I hope that the self-hosted runner pool is big enough to actually result in an improvement in the total runtime of the check over the github-hosted pool.

.github/workflows/llama_cpp_plugin_build_and_test.yml

…)" This reverts commit 8759969.

…)" (#896) This reverts commit 8759969.

…openvinotoolkit#23432) Added an extra conditional branch specifically for LLAMA_CPP_* plugins openvinotoolkit/openvino_contrib#891) that need to manage loading the model directly from disk on their own without instantiating ov::Model.

github-actions bot added category: build OpenVINO cmake script / infra dependencies Pull requests that update a dependency file labels Mar 13, 2024

vshampor changed the title ~~LLAMA_CPP plugin - basic version with direct file~~ LLAMA_CPP plugin - basic version with direct file loading Mar 13, 2024

vshampor mentioned this pull request Mar 13, 2024

Add code path for LLAMA_CPP plugins to load models directly from file openvinotoolkit/openvino#23432

Merged

github-actions bot added the category: CI OpenVINO public CI label Mar 13, 2024

vshampor force-pushed the llama_cpp_plugin branch from a7c223a to 53a3438 Compare March 13, 2024 16:00

vshampor changed the title ~~LLAMA_CPP plugin - basic version with direct file loading~~ [DRAFT] LLAMA_CPP plugin - basic version with direct file loading Mar 13, 2024

vshampor marked this pull request as ready for review March 13, 2024 16:01

vshampor requested review from a team as code owners March 13, 2024 16:01

vshampor force-pushed the llama_cpp_plugin branch 5 times, most recently from 8b19bef to c3da5c5 Compare March 13, 2024 16:17

ilya-lavrenov reviewed Mar 13, 2024

View reviewed changes

AlexKoff88 reviewed Mar 14, 2024

View reviewed changes

modules/llama_cpp_plugin/build.sh Outdated Show resolved Hide resolved

AlexKoff88 reviewed Mar 14, 2024

View reviewed changes

.github/workflows/llama_cpp_plugin_build_and_test.yml Outdated Show resolved Hide resolved

vshampor force-pushed the llama_cpp_plugin branch 8 times, most recently from 17bec0c to 7046854 Compare March 18, 2024 09:50

vshampor added 12 commits March 21, 2024 12:32

Fix workflow

dff65c7

Allow resetting llama kv cache with .reset_state

d43fa2f

Set executable mode on test binary

cf25e3e

Align thread setting with llama's main executable

795273c

Set library path in workflow

9990329

Add step to install libtbb

d250440

Take n_ctx from model

6f938ce

Remove debug print

520bf77

Add README.md

6283758

Install correct libtbb

43e1741

Use OV from master

1c6c51e

Remove gitmodules

53fe441

vshampor force-pushed the llama_cpp_plugin branch from a8bf767 to 53fe441 Compare March 21, 2024 11:32

ilya-lavrenov reviewed Mar 22, 2024

View reviewed changes

Apply comments

d5447c9

ilya-lavrenov approved these changes Mar 22, 2024

View reviewed changes

ilya-lavrenov added this to the 2024.1 milestone Mar 22, 2024

vshampor force-pushed the llama_cpp_plugin branch 2 times, most recently from d019d8f to cbc77b7 Compare March 25, 2024 10:01

Fix workflow indents

de225a5

vshampor force-pushed the llama_cpp_plugin branch from cbc77b7 to de225a5 Compare March 25, 2024 10:54

akashchi reviewed Mar 25, 2024

View reviewed changes

Improve workflow

aef9948

vshampor requested a review from akashchi March 25, 2024 16:29

ilya-lavrenov merged commit 8759969 into openvinotoolkit:master Mar 25, 2024
4 of 6 checks passed

akladiev added a commit that referenced this pull request Mar 26, 2024

Revert "LLAMA_CPP plugin - basic version with direct file loading (#891…

4cb5869

…)" This reverts commit 8759969.

akladiev mentioned this pull request Mar 26, 2024

Revert "LLAMA_CPP plugin - basic version with direct file loading" #896

Merged

akladiev added a commit that referenced this pull request Mar 26, 2024

Revert "LLAMA_CPP plugin - basic version with direct file loading (#891…

3eeb232

…)" (#896) This reverts commit 8759969.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLAMA_CPP plugin - basic version with direct file loading #891

LLAMA_CPP plugin - basic version with direct file loading #891

vshampor commented Mar 13, 2024 •

edited

Loading

ilya-lavrenov Mar 13, 2024

vshampor Mar 14, 2024

ilya-lavrenov Mar 14, 2024

ilya-lavrenov Mar 13, 2024

vshampor Mar 14, 2024

ilya-lavrenov Mar 13, 2024

vshampor Mar 14, 2024

vshampor Mar 18, 2024

AlexKoff88 commented Mar 14, 2024

AlexKoff88 commented Mar 14, 2024

AlexKoff88 commented Mar 14, 2024

ilya-lavrenov Mar 22, 2024

vshampor Mar 22, 2024

ilya-lavrenov Mar 22, 2024

vshampor Mar 22, 2024

ilya-lavrenov commented Mar 24, 2024

akashchi Mar 25, 2024

vshampor Mar 25, 2024

akashchi Mar 25, 2024

vshampor Mar 25, 2024


		const int64_t* sequence_start_ptr = data_ptr /* + seq_idx */;

		for (size_t tok_idx = 0; tok_idx < sequence_length; ++tok_idx) {

LLAMA_CPP plugin - basic version with direct file loading #891

LLAMA_CPP plugin - basic version with direct file loading #891

Conversation

vshampor commented Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexKoff88 commented Mar 14, 2024

AlexKoff88 commented Mar 14, 2024

AlexKoff88 commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilya-lavrenov commented Mar 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vshampor commented Mar 13, 2024 •

edited

Loading