-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support llama.cpp #121
Comments
Good news, looks like llama-cpp-python is packaged by this awesome repo: https://github.com/nixified-ai/flake and I'll soon find out if it can be run with:
Edit: hmm doesn't expose that, will dig in more later to tryout compatibility of llama-cpp-python and gptel. |
I couldn't find any info on the llama-cpp-python's web API except for what's in the Github README but if what it says is correct, support for it in gptel should be trivial: (defvar gptel--llama-cpp-python
(gptel-make-openai
"llama-cpp-python"
:stream t ;If llama-cpp-python supports streaming responses
:protocol "ws"
:host "localhost:8000"
:endpoint "/api/v1/chat-stream"
:models '("list" "of" "available" "model" "names"))
"GPTel backend for llama-cpp-python.")
;; Make it the default
(setq-default gptel-backend gptel--llama-cpp-python
gptel-model "name") Unfortunately I can't test this -- no GPU, and I'm also on Nix so it's not easy to install. |
Do you have a link to llama.cpp's (not -python) API documentation? EDIT:
Incidentally, I couldn't get it to run on NixOS at all, and couldn't get the package to build when I tried the latest version. The latest binary release from Ollama worked perfectly (including GPU support) on Arch on a different machine. |
There isn't any, I found this related issue: That's where I learned about |
I think that I'm going to be able to use what you linked above after this finishes (but it's 17GB): nix run github:nixified-ai/flake#packages.x86_64-linux.textgen-nvidia |
Cool, please let me know if it works as expected -- including the streaming responses bit. |
It doesn't seem to work. I noticed there are examples in the text-generation-webui repo though: So I modified the above to use 5005:
It still didn't work and gave a 404 though. |
I edited the snippet (added an EDIT: Also it looks like the protocol is not http, it's ws. I'm checking if Curl handles that... |
More details on their endpoint support: https://github.com/oobabooga/text-generation-webui/blob/262f8ae5bb49b2fb1d9aac9af01e3e5cd98765db/extensions/openai/README.md?plain=1#L190 |
Ah, you are right. It didn't work. curl should support ws. |
Did you try it with the |
Ah, I just realized it's going to fail anyway because gptel expects a HTTP 200/OK message. But it will help to check if the API works as expected with the following Curl command: curl --location --silent --compressed --disable -XPOST -w(abcdefgh . %{size_header}) -m60 -D- -d'{"model":"nous-hermes-llama2-13b.Q4_0.gguf","messages":[{"role":"system","content":"You are a large language model living in Emacs and a helpful assistant. Respond concisely."},{"role":"user","content":"Hello"}],"stream":true,"temperature":1.0}' -H"Content-Type: application/json" "ws://localhost:5005/api/v1/chat-stream" The output will help me add support for it as well. |
|
I'm actually unable to get textgen from the nixified-ai flake working anyway, or just lllama-cpp-python. I might look at interoperating purely with llama.cpp again. Reason being it's hard to tell which versions of llama-cpp-python will even work with llama-cpp and I don't understand how to debug them well. |
Hmm, I'm guessing I need to look into Curl's websocket support. I don't think there's a quick fix to support llama-cpp-python in gptel after all.
Local LLM support is a bit of a mess across the board right now. |
This may help: https://github.com/kurnevsky/llama-cpp.el |
It might also be useful to know that litellm converts tons of llm's to an open-ai compatible proxy: https://docs.litellm.ai/docs/simple_proxy However... I'm concerned by this:
Not sure if a misunderstanding or I'm missing something about litellm. Edit: Maybe I'm misunderstanding... idk... maybe you can sort out if this is both private and useful or me after a nap can 😉
|
@ParetoOptimalDev |
llama.cpp recently added support for the openai api to their built in server. It was pretty easy to get working with gptel using following config:
|
@richardmurri That's fantastic! @ParetoOptimalDev Let me know if Richard's config works for you, and I can close this issue. |
I just tried this and it didn't work for me using |
Oh I think I usually use |
So I got it working with:
And the modification of the above to use default 8081 port like below:
Maybe I can convince llama.cpp to add an app for |
I made a pull request to add the openai proxy as a flake app: If merged, the process would become simplified to: Run the server and the proxy
Create a backend to connect to the openai proxy
|
@ParetoOptimalDev Thanks for pursuing this. I'm curious to know if the OpenAI-compatible API is easily accessible in the imperative, non-nix version of llama.cpp. If it is, I can add the instructions to the README. |
I think it would work to just do this in the llama.cpp repo:
I just created this locally and verified it works with the nix version btw: (defvar gptel--llama-cpp-openai
(gptel-make-openai
"llama-cpp--openai"
:stream nil ;If llama-cpp-python supports streaming responses
:protocol "http"
:host "localhost:8081"
:models '("dolphin-2.2.1-mistral-7b.Q5_K_M.gguf"))
"GPTel backend for llama-cpp-openai.")
) I was actually inspired by your recent very well put together video @karthink! 😄 |
FWIW, I wasn't using |
Ohhh! Thank you... I need to read more carefully:
|
@richardmurri Thanks for the clarification! I'll add instructions for llama.cpp (with the caveat that you need a recent version) to the README. |
Do you have a link to the commit or some documentation for the Llama.cpp version that adds support for the OpenAI-compatible API? EDIT: I found the official documentation but it's a little fuzzy. |
* README.org: The llama.cpp server supports OpenAI's API, so we can reuse it. Closes #121.
Does llama.cpp respect the system-message/directive when used from gptel for you? I don't have the hardware to test it, and received a couple of mixed reports. |
It does seem to be using the directive in my use, but I'll admit I haven't delved into what it's actually doing under the hood much. I haven't been involved in the development, just a happy user. Here is also the link of original pull request that added support for OpenAI: ggerganov/llama.cpp#4198 |
I can't get ollama to work with gpu accelleration, so I'm using llama.cpp which has a Nix flake that worked perfectly (once I understood "cuda" was the cuda version and not the cuda library) 😍
It looks like llama.cpp has a different api so I can't just use
(gptel-make-ollama
. This sound correct?Then again I see something about llama-cpp-python having an "OpenAI-like API". The downside of this being I'll have to package llama-cpp-python for nix
Maybe I can use that and gptel somehow? Just looking for a bit of guidance, but will tinker around when I get time and try things. If I find anything useful I'll report back here.
The text was updated successfully, but these errors were encountered: