Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support llama.cpp #121

Closed
ParetoOptimalDev opened this issue Oct 31, 2023 · 32 comments
Closed

Support llama.cpp #121

ParetoOptimalDev opened this issue Oct 31, 2023 · 32 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ParetoOptimalDev
Copy link

ParetoOptimalDev commented Oct 31, 2023

I can't get ollama to work with gpu accelleration, so I'm using llama.cpp which has a Nix flake that worked perfectly (once I understood "cuda" was the cuda version and not the cuda library) 😍

It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?

Then again I see something about llama-cpp-python having an "OpenAI-like API". The downside of this being I'll have to package llama-cpp-python for nix

Maybe I can use that and gptel somehow? Just looking for a bit of guidance, but will tinker around when I get time and try things. If I find anything useful I'll report back here.

@ParetoOptimalDev
Copy link
Author

ParetoOptimalDev commented Oct 31, 2023

Good news, looks like llama-cpp-python is packaged by this awesome repo:

https://github.com/nixified-ai/flake

and I'll soon find out if it can be run with:

nix shell github:nixified-ai/flake -c llama-cpp-python

Edit: hmm doesn't expose that, will dig in more later to tryout compatibility of llama-cpp-python and gptel.

@karthink
Copy link
Owner

karthink commented Oct 31, 2023

I couldn't find any info on the llama-cpp-python's web API except for what's in the Github README but if what it says is correct, support for it in gptel should be trivial:

(defvar gptel--llama-cpp-python 
  (gptel-make-openai
   "llama-cpp-python"
   :stream t                   ;If llama-cpp-python supports streaming responses
   :protocol "ws"
   :host "localhost:8000"
   :endpoint "/api/v1/chat-stream"
   :models '("list" "of" "available" "model" "names"))
  "GPTel backend for llama-cpp-python.")

;; Make it the default
(setq-default gptel-backend gptel--llama-cpp-python
              gptel-model   "name")

Unfortunately I can't test this -- no GPU, and I'm also on Nix so it's not easy to install.

@karthink
Copy link
Owner

karthink commented Oct 31, 2023

It looks like llama.cpp has a different api so I can't just use (gptel-make-ollama. This sound correct?

Do you have a link to llama.cpp's (not -python) API documentation?

EDIT:

I can't get ollama to work with gpu accelleration

Incidentally, I couldn't get it to run on NixOS at all, and couldn't get the package to build when I tried the latest version. The latest binary release from Ollama worked perfectly (including GPU support) on Arch on a different machine.

@ParetoOptimalDev
Copy link
Author

There isn't any, I found this related issue:

ggerganov/llama.cpp#1742

That's where I learned about llama-cpp-python.

@ParetoOptimalDev
Copy link
Author

ParetoOptimalDev commented Oct 31, 2023

I think that I'm going to be able to use what you linked above after this finishes (but it's 17GB):

nix run github:nixified-ai/flake#packages.x86_64-linux.textgen-nvidia 

@karthink
Copy link
Owner

I think that I'm going to be able to use what you linked above after this finishes (but it's 16GB or more):

Cool, please let me know if it works as expected -- including the streaming responses bit.

@ParetoOptimalDev
Copy link
Author

It doesn't seem to work. I noticed there are examples in the text-generation-webui repo though:

https://github.com/oobabooga/text-generation-webui/blob/main/api-examples/api-example-chat-stream.py

So I modified the above to use 5005:

(defvar gptel--llama-cpp-python 
      (gptel-make-openai
       "llama-cpp-python"
       :stream t               ;If llama-cpp-python supports streaming responses
       :protocol "http"
       :host "localhost:5005"
       :models '("nous-hermes-llama2-13b.Q4_0.gguf"))
      "GPTel backend for llama-cpp-python.")

It still didn't work and gave a 404 though.

@karthink
Copy link
Owner

karthink commented Oct 31, 2023

It still didn't work and gave a 404 though.

I edited the snippet (added an :endpoint field), any luck?

EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...

@ParetoOptimalDev
Copy link
Author

It still didn't work and gave a 404 though.

I edited the snippet (added an :endpoint field), any luck?

EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that...

Ah, you are right. It didn't work. curl should support ws.

@karthink
Copy link
Owner

karthink commented Oct 31, 2023

Did you try it with the :protocol set to "ws"?

@karthink
Copy link
Owner

karthink commented Oct 31, 2023

Ah, I just realized it's going to fail anyway because gptel expects a HTTP 200/OK message. But it will help to check if the API works as expected with the following Curl command:

curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"

The output will help me add support for it as well.

@ParetoOptimalDev
Copy link
Author

$ curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
Malformed access time modifier ‘a’
$ curl --location --silent --compressed --disable -XPOST -w "(abcdefgh . %{size_header})" -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream"
(abcdefgh . 0)

@ParetoOptimalDev
Copy link
Author

I'm actually unable to get textgen from the nixified-ai flake working anyway, or just lllama-cpp-python. I might look at interoperating purely with llama.cpp again.

Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.

@karthink
Copy link
Owner

karthink commented Oct 31, 2023

Hmm, I'm guessing I need to look into Curl's websocket support. I don't think there's a quick fix to support llama-cpp-python in gptel after all.

Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well.

Local LLM support is a bit of a mess across the board right now.

@ParetoOptimalDev
Copy link
Author

@ParetoOptimalDev
Copy link
Author

ParetoOptimalDev commented Nov 3, 2023

It might also be useful to know that litellm converts tons of llm's to an open-ai compatible proxy:

https://docs.litellm.ai/docs/simple_proxy

However... I'm concerned by this:

This is not even touching on the privacy implications of potentially unnecessarily routing every MemGPT user's personal traffic through a startup's servers. - letta-ai/letta#86 (comment)

Not sure if a misunderstanding or I'm missing something about litellm.

Edit: Maybe I'm misunderstanding... idk... maybe you can sort out if this is both private and useful or me after a nap can 😉

litellm isn't a proxy server. we let users spin up an openai-compatible server if they'd like.

It's just a python package for translating llm api calls. I agree with you, unnecessarily routing things through a proxy would be a bit weird.

@havaker
Copy link

havaker commented Nov 15, 2023

I can't get ollama to work with gpu accelleration

@ParetoOptimalDev
I faced a similar issue recently, but I was able to make a flake that provides gpu-accelerated (cuda) ollama. If you're using x86-64_linux system, feel free to chceck it out github.com:havaker/ollama-nix.

@richardmurri
Copy link

llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8000"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

@karthink
Copy link
Owner

@richardmurri That's fantastic!

@ParetoOptimalDev Let me know if Richard's config works for you, and I can close this issue.

@karthink karthink added enhancement New feature or request help wanted Extra attention is needed labels Dec 15, 2023
@ParetoOptimalDev
Copy link
Author

llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8000"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

I just tried this and it didn't work for me using llama-server, but perhaps that's not the one with openai support referenced here:

https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/examples/server/README.md?plain=1#L329

@ParetoOptimalDev
Copy link
Author

ParetoOptimalDev commented Dec 23, 2023

Oh I think llama-server is specific to the nix expression and in the makefile points to:

https://github.com/ggerganov/llama.cpp/blob/708e179e8562c2604240df95a2241dea17fd808b/Makefile#L625

I usually use nix shell github:ggerganov/llama.cpp -c llama-server is the issue.... that doesn't point to an openai compatible server.

@ParetoOptimalDev
Copy link
Author

ParetoOptimalDev commented Dec 23, 2023

So I got it working with:

~/code/llama.cpp $ nix develop -c python examples/server/api_like_OAI.py
~/code/llama.cpp $ git diff
diff --git a/flake.nix b/flake.nix
index 4cf28d5..eba31cc 100644
--- a/flake.nix
+++ b/flake.nix
@@ -49,7 +49,7 @@
           ];
         };
         llama-python =
-          pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece ]);
+          pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece flask requests ]);
         # TODO(Green-Sky): find a better way to opt-into the heavy ml python runtime
         llama-python-extra =
           pkgs.python3.withPackages (ps: with ps; [ numpy sentencepiece torchWithoutCuda transformers ]);
~/code/llama.cpp $ python examples/server/api_like_OAI.py
 * Serving Flask app 'api_like_OAI'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:8081
Press CTRL+C to quit

And the modification of the above to use default 8081 port like below:

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8081"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

Maybe I can convince llama.cpp to add an app for openai-proxy?

@ParetoOptimalDev
Copy link
Author

I made a pull request to add the openai proxy as a flake app:

ggerganov/llama.cpp#4612

If merged, the process would become simplified to:

Run the server and the proxy

nix run github:ggerganov/llama.cpp#llama-server
nix run github:ggerganov/llama.cpp#llama-server-openai-proxy

Create a backend to connect to the openai proxy

(defvar gptel--llama-cpp
  (gptel-make-openai
   "llama-cpp"
   :stream t
   :protocol "http"
   :host "localhost:8081"
   :models '("test"))
  "GPTel backend for llama-cpp.")

(setq-default gptel-backend gptel--llama-cpp
              gptel-model   "test")

@karthink
Copy link
Owner

@ParetoOptimalDev Thanks for pursuing this. I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.

@ParetoOptimalDev
Copy link
Author

ParetoOptimalDev commented Dec 28, 2023

I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README.

I think it would work to just do this in the llama.cpp repo:

  • create python venv
  • install requirements flask and requests
  • python examples/server/api_like_OAI.py

I just created this locally and verified it works with the nix version btw:

 (defvar gptel--llama-cpp-openai
    (gptel-make-openai
     "llama-cpp--openai"
     :stream nil               ;If llama-cpp-python supports streaming responses
     :protocol "http"
     :host "localhost:8081"
     :models '("dolphin-2.2.1-mistral-7b.Q5_K_M.gguf"))
    "GPTel backend for llama-cpp-openai.")
)

I was actually inspired by your recent very well put together video @karthink! 😄

@richardmurri
Copy link

FWIW, I wasn't using api_like_OAI.py when I said it was working in llama.cpp. I was using the default server binary, creating when running make in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096 and you should be good to go. Make sure your checked out version is fairly recent.

@ParetoOptimalDev
Copy link
Author

FWIW, I wasn't using api_like_OAI.py when I said it was working in llama.cpp. I was using the default server binary, creating when running make in the base directory. Just specify a port on the command to run, something like ./server -m models/mistral-7b-instruct-v0.2.Q5_K_S.gguf --port 8000 -c 4096 and you should be good to go. Make sure your checked out version is fairly recent.

Ohhh! Thank you... I need to read more carefully:

llama.cpp recently added support for the openai api to their built in server.

@karthink
Copy link
Owner

@richardmurri Thanks for the clarification! I'll add instructions for llama.cpp (with the caveat that you need a recent version) to the README.

@karthink
Copy link
Owner

karthink commented Dec 29, 2023

Do you have a link to the commit or some documentation for the Llama.cpp version that adds support for the OpenAI-compatible API?

EDIT: I found the official documentation but it's a little fuzzy.

karthink added a commit that referenced this issue Dec 29, 2023
* README.org: The llama.cpp server supports OpenAI's API, so we
can reuse it.  Closes #121.
@karthink
Copy link
Owner

Does llama.cpp respect the system-message/directive when used from gptel for you? I don't have the hardware to test it, and received a couple of mixed reports.

@richardmurri
Copy link

It does seem to be using the directive in my use, but I'll admit I haven't delved into what it's actually doing under the hood much. I haven't been involved in the development, just a happy user.

Here is also the link of original pull request that added support for OpenAI: ggerganov/llama.cpp#4198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants