Support llama.cpp #121

ParetoOptimalDev · 2023-10-31T17:13:44Z

I can't get ollama to work with gpu accelleration, so I'm using llama.cpp which has a Nix flake that worked perfectly (once I understood "cuda" was the cuda version and not the cuda library) 😍

It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?

Then again I see something about llama-cpp-python having an "OpenAI-like API". The downside of this being I'll have to package llama-cpp-python for nix

Maybe I can use that and gptel somehow? Just looking for a bit of guidance, but will tinker around when I get time and try things. If I find anything useful I'll report back here.

The text was updated successfully, but these errors were encountered:

ParetoOptimalDev · 2023-10-31T17:19:20Z

Good news, looks like llama-cpp-python is packaged by this awesome repo:

https://github.com/nixified-ai/flake

and I'll soon find out if it can be run with:

nix shell github:nixified-ai/flake -c llama-cpp-python

Edit: hmm doesn't expose that, will dig in more later to tryout compatibility of llama-cpp-python and gptel.

karthink · 2023-10-31T17:28:32Z

I couldn't find any info on the llama-cpp-python's web API except for what's in the Github README but if what it says is correct, support for it in gptel should be trivial:

(defvar gptel--llama-cpp-python 
  (gptel-make-openai
   "llama-cpp-python"
   :stream t                   ;If llama-cpp-python supports streaming responses
   :protocol "ws"
   :host "localhost:8000"
   :endpoint "/api/v1/chat-stream"
   :models '("list" "of" "available" "model" "names"))
  "GPTel backend for llama-cpp-python.")

;; Make it the default
(setq-default gptel-backend gptel--llama-cpp-python
              gptel-model   "name")

Unfortunately I can't test this -- no GPU, and I'm also on Nix so it's not easy to install.

karthink · 2023-10-31T17:37:22Z

It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?

Do you have a link to llama.cpp's (not -python) API documentation?

EDIT:

I can't get ollama to work with gpu accelleration

Incidentally, I couldn't get it to run on NixOS at all, and couldn't get the package to build when I tried the latest version. The latest binary release from Ollama worked perfectly (including GPU support) on Arch on a different machine.

ParetoOptimalDev · 2023-10-31T17:42:45Z

There isn't any, I found this related issue:

ggerganov/llama.cpp#1742

That's where I learned about llama-cpp-python.

ParetoOptimalDev · 2023-10-31T17:45:26Z

I think that I'm going to be able to use what you linked above after this finishes (but it's 17GB):

nix run github:nixified-ai/flake#packages.x86_64-linux.textgen-nvidia

karthink · 2023-10-31T17:46:23Z

I think that I'm going to be able to use what you linked above after this finishes (but it's 16GB or more):

Cool, please let me know if it works as expected -- including the streaming responses bit.

ParetoOptimalDev · 2023-10-31T17:58:30Z

It doesn't seem to work. I noticed there are examples in the text-generation-webui repo though:

https://github.com/oobabooga/text-generation-webui/blob/main/api-examples/api-example-chat-stream.py

So I modified the above to use 5005:

(defvar gptel--llama-cpp-python 
      (gptel-make-openai
       "llama-cpp-python"
       :stream t               ;If llama-cpp-python supports streaming responses
       :protocol "http"
       :host "localhost:5005"
       :models '("nous-hermes-llama2-13b.Q4_0.gguf"))
      "GPTel backend for llama-cpp-python.")

It still didn't work and gave a 404 though.

karthink · 2023-10-31T18:01:11Z

It still didn't work and gave a 404 though.

I edited the snippet (added an :endpoint field), any luck?

EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...

ParetoOptimalDev · 2023-10-31T18:02:05Z

More details on their endpoint support: https://github.com/oobabooga/text-generation-webui/blob/262f8ae5bb49b2fb1d9aac9af01e3e5cd98765db/extensions/openai/README.md?plain=1#L190

ParetoOptimalDev · 2023-10-31T18:04:03Z

It still didn't work and gave a 404 though.

I edited the snippet (added an :endpoint field), any luck?

EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...

Ah, you are right. It didn't work. curl should support ws.

karthink · 2023-10-31T18:04:35Z

Did you try it with the :protocol set to "ws"?

karthink · 2023-10-31T18:11:35Z

Ah, I just realized it's going to fail anyway because gptel expects a HTTP 200/OK message. But it will help to check if the API works as expected with the following Curl command:

curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"

The output will help me add support for it as well.

ParetoOptimalDev · 2023-10-31T20:13:46Z

$ curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
Malformed access time modifier ‘a’
$ curl --location --silent --compressed --disable -XPOST -w "(abcdefgh . %{size_header})" -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
(abcdefgh . 0)

ParetoOptimalDev · 2023-10-31T20:18:21Z

I'm actually unable to get textgen from the nixified-ai flake working anyway, or just lllama-cpp-python. I might look at interoperating purely with llama.cpp again.

Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.

karthink · 2023-10-31T20:21:27Z

Hmm, I'm guessing I need to look into Curl's websocket support. I don't think there's a quick fix to support llama-cpp-python in gptel after all.

Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.

Local LLM support is a bit of a mess across the board right now.

ParetoOptimalDev · 2023-11-03T05:20:50Z

This may help: https://github.com/kurnevsky/llama-cpp.el

ParetoOptimalDev · 2023-11-03T20:58:01Z

It might also be useful to know that litellm converts tons of llm's to an open-ai compatible proxy:

https://docs.litellm.ai/docs/simple_proxy

However... I'm concerned by this:

This is not even touching on the privacy implications of potentially unnecessarily routing every MemGPT user's personal traffic through a startup's servers. - letta-ai/letta#86 (comment)

Not sure if a misunderstanding or I'm missing something about litellm.

Edit: Maybe I'm misunderstanding... idk... maybe you can sort out if this is both private and useful or me after a nap can 😉

litellm isn't a proxy server. we let users spin up an openai-compatible server if they'd like.

It's just a python package for translating llm api calls. I agree with you, unnecessarily routing things through a proxy would be a bit weird.

havaker · 2023-11-15T13:03:47Z

I can't get ollama to work with gpu accelleration

@ParetoOptimalDev
I faced a similar issue recently, but I was able to make a flake that provides gpu-accelerated (cuda) ollama. If you're using x86-64_linux system, feel free to chceck it out github.com:havaker/ollama-nix.

richardmurri · 2023-11-30T18:33:46Z

llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8000"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

karthink · 2023-11-30T19:20:30Z

@richardmurri That's fantastic!

@ParetoOptimalDev Let me know if Richard's config works for you, and I can close this issue.

ParetoOptimalDev · 2023-12-23T22:16:17Z

llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:
(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8000"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

I just tried this and it didn't work for me using llama-server, but perhaps that's not the one with openai support referenced here:

https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/examples/server/README.md?plain=1#L329

ParetoOptimalDev · 2023-12-23T22:21:41Z

Oh I think llama-server is specific to the nix expression and in the makefile points to:

https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/Makefile#L625

I usually use nix shell github:ggerganov/llama.cpp -c llama-server is the issue.... that doesn't point to an openai compatible server.

ParetoOptimalDev · 2023-12-23T22:36:35Z

So I got it working with:

~/code/llama.cpp $ nix develop -c python examples/server/api_like_OAI.py
~/code/llama.cpp $ git diff
diff --git a/flake.nix b/flake.nix
index 4cf28d5..eba31cc 100644
--- a/flake.nix
+++ b/flake.nix
@@ -49,7 +49,7 @@
           ];
         };
         llama-python =
-          pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece ]);
+          pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece flask requests ]);
         # TODO(Green-Sky): find a better way to opt-into the heavy ml python runtime
         llama-python-extra =
           pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece torchWithoutCuda transformers ]);
~/code/llama.cpp $ python examples/server/api_like_OAI.py
 * Serving Flask app 'api_like_OAI'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:8081
Press CTRL+C to quit

And the modification of the above to use default 8081 port like below:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8081"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

Maybe I can convince llama.cpp to add an app for openai-proxy?

ParetoOptimalDev · 2023-12-23T22:55:31Z

I made a pull request to add the openai proxy as a flake app:

ggerganov/llama.cpp#4612

If merged, the process would become simplified to:

Run the server and the proxy

nix run github:ggerganov/llama.cpp#llama-server
nix run github:ggerganov/llama.cpp#llama-server-openai-proxy

Create a backend to connect to the openai proxy

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8081"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

karthink · 2023-12-25T18:28:44Z

@ParetoOptimalDev Thanks for pursuing this. I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.

ParetoOptimalDev · 2023-12-28T22:57:37Z

I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.

I think it would work to just do this in the llama.cpp repo:

create python venv
install requirements flask and requests
python examples/server/api_like_OAI.py

I just created this locally and verified it works with the nix version btw:

 (defvar gptel--llama-cpp-openai
    (gptel-make-openai
     "llama-cpp--openai"
     :stream nil               ;If llama-cpp-python supports streaming responses
     :protocol "http"
     :host "localhost:8081"
     :models '("dolphin-2.2.1-mistral-7b.Q5_K_M.gguf"))
    "GPTel backend for llama-cpp-openai.")
)

I was actually inspired by your recent very well put together video @karthink! 😄

richardmurri · 2023-12-28T23:23:59Z

FWIW, I wasn't using api_like_OAI.py when I said it was working in llama.cpp. I was using the default server binary, creating when running make in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096 and you should be good to go. Make sure your checked out version is fairly recent.

ParetoOptimalDev · 2023-12-29T01:00:41Z

FWIW, I wasn't using api_like_OAI.py when I said it was working in llama.cpp. I was using the default server binary, creating when running make in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096 and you should be good to go. Make sure your checked out version is fairly recent.

Ohhh! Thank you... I need to read more carefully:

llama.cpp recently added support for the openai api to their built in server.

karthink · 2023-12-29T05:42:57Z

@richardmurri Thanks for the clarification! I'll add instructions for llama.cpp (with the caveat that you need a recent version) to the README.

karthink · 2023-12-29T06:13:37Z

Do you have a link to the commit or some documentation for the Llama.cpp version that adds support for the OpenAI-compatible API?

EDIT: I found the official documentation but it's a little fuzzy.

* README.org: The llama.cpp server supports OpenAI's API, so we can reuse it. Closes #121.

karthink · 2023-12-31T22:09:06Z

Does llama.cpp respect the system-message/directive when used from gptel for you? I don't have the hardware to test it, and received a couple of mixed reports.

richardmurri · 2024-01-02T15:58:04Z

It does seem to be using the directive in my use, but I'll admit I haven't delved into what it's actually doing under the hood much. I haven't been involved in the development, just a happy user.

Here is also the link of original pull request that added support for OpenAI: ggerganov/llama.cpp#4198

karthink added enhancement New feature or request help wanted Extra attention is needed labels Dec 15, 2023

karthink closed this as completed in 65011d2 Dec 29, 2023

karthink added a commit that referenced this issue Dec 29, 2023

README: Add support for llama.cpp

85bd47c

* README.org: The llama.cpp server supports OpenAI's API, so we can reuse it. Closes #121.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support llama.cpp #121

Support llama.cpp #121

ParetoOptimalDev commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023

karthink commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023

karthink commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023

karthink commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Nov 3, 2023

ParetoOptimalDev commented Nov 3, 2023 •

edited

Loading

havaker commented Nov 15, 2023

richardmurri commented Nov 30, 2023

karthink commented Nov 30, 2023

ParetoOptimalDev commented Dec 23, 2023

ParetoOptimalDev commented Dec 23, 2023 •

edited

Loading

ParetoOptimalDev commented Dec 23, 2023 •

edited

Loading

ParetoOptimalDev commented Dec 23, 2023

karthink commented Dec 25, 2023

ParetoOptimalDev commented Dec 28, 2023 •

edited

Loading

richardmurri commented Dec 28, 2023

ParetoOptimalDev commented Dec 29, 2023

karthink commented Dec 29, 2023

karthink commented Dec 29, 2023 •

edited

Loading

karthink commented Dec 31, 2023

richardmurri commented Jan 2, 2024

Support llama.cpp #121

Support llama.cpp #121

Comments

ParetoOptimalDev commented Oct 31, 2023 • edited Loading

ParetoOptimalDev commented Oct 31, 2023 • edited Loading

karthink commented Oct 31, 2023 • edited Loading

karthink commented Oct 31, 2023 • edited Loading

ParetoOptimalDev commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023 • edited Loading

karthink commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023

karthink commented Oct 31, 2023 • edited Loading

ParetoOptimalDev commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023

karthink commented Oct 31, 2023 • edited Loading

karthink commented Oct 31, 2023 • edited Loading

ParetoOptimalDev commented Oct 31, 2023

ParetoOptimalDev commented Oct 31, 2023

karthink commented Oct 31, 2023 • edited Loading

ParetoOptimalDev commented Nov 3, 2023

ParetoOptimalDev commented Nov 3, 2023 • edited Loading

havaker commented Nov 15, 2023

richardmurri commented Nov 30, 2023

karthink commented Nov 30, 2023

ParetoOptimalDev commented Dec 23, 2023

ParetoOptimalDev commented Dec 23, 2023 • edited Loading

ParetoOptimalDev commented Dec 23, 2023 • edited Loading

ParetoOptimalDev commented Dec 23, 2023

Run the server and the proxy

Create a backend to connect to the openai proxy

karthink commented Dec 25, 2023

ParetoOptimalDev commented Dec 28, 2023 • edited Loading

richardmurri commented Dec 28, 2023

ParetoOptimalDev commented Dec 29, 2023

karthink commented Dec 29, 2023

karthink commented Dec 29, 2023 • edited Loading

karthink commented Dec 31, 2023

richardmurri commented Jan 2, 2024

ParetoOptimalDev commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

karthink commented Oct 31, 2023 •

edited

Loading

ParetoOptimalDev commented Nov 3, 2023 •

edited

Loading

ParetoOptimalDev commented Dec 23, 2023 •

edited

Loading

ParetoOptimalDev commented Dec 23, 2023 •

edited

Loading

ParetoOptimalDev commented Dec 28, 2023 •

edited

Loading

karthink commented Dec 29, 2023 •

edited

Loading