-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BOUNTY - $500] Llama.cpp inference engine #167
Comments
Thanks @AlexCheema Myself and many others likely only have windows systems, and llama.cpp is practically the only option. MLX is macOS and tinygrad:
For this opening statement to be true, it would need to include windows-based systems, especially old gaming rigs.
At the very least, a thorough guide on setting up tinygrad via WSL/WSL2 would be appreciated, because this is your only documentation:
|
I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support. |
Go for it! |
@bayedieng I'd recommend looking at https://github.com/abetlen/llama-cpp-python -- it should hopefully be low level enough to do what we need to do. Also, I'd recommend looking at #139 for a minimal implementation of an inference engine that doesn't require explicitly defining every model -- it's a general solution. |
Thanks for the suggestion. Yeah I had already seen the python bindings and went ahead and began a draft PR. |
I wonder if WebGPU can be plugged on top of Llama.cpp via this https://github.com/AnswerDotAI/gpu.cpp wrapper ? |
It would seem that the LLAMA CPP API is too high level to perform sharded inference as it doesn't provide access to individual layers. In order to do so, one would have to use the GGML bindings directly to create a suitable inference engine compatible with Exo. It was stated that it would be nice to have a similar API as the Pytorch engine where a base model and reused everywhere as provided with @AlexCheema please let me know if you'd like to go forward and build the inference engine with ggml bindings. |
Heads up, LocalAI now has distributed inference: |
Not familiar with them but they also likely just use the underlying GGML API as well, considering LLama CPP does inference end-to-end. LLama CPP also has a distributed inference example using GGML: https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc |
I've been caught up recently in a whole bunch of work recently and have had little time to work on the LLama support. I wouldn't would to hold the PR hostage if someone's capable of completing it go ahead! |
agree
Thanks for the info. I looked through the RPC example and found out that it uses the GGML RPC backend. With other backends, it is scheduled by GGML, which requires all nodes to be GGML nodes. However, in Exo, each node is independent, and some nodes may not use the GGML backend. So, I think we can just use the GGML backend API and then wrap it up for Exo. |
Related to exo-explore#167
The text was updated successfully, but these errors were encountered: