Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BOUNTY - $500] Llama.cpp inference engine #167

Open
AlexCheema opened this issue Aug 22, 2024 · 11 comments
Open

[BOUNTY - $500] Llama.cpp inference engine #167

AlexCheema opened this issue Aug 22, 2024 · 11 comments

Comments

@AlexCheema
Copy link
Contributor

  • it should automatically detect the best device to run on
  • We should require 0 manual configuration from the user, by default llama.cpp for example requires specifying the device
@danny-avila
Copy link

Thanks @AlexCheema

Myself and many others likely only have windows systems, and llama.cpp is practically the only option.

MLX is macOS and tinygrad:

Windows support has been dropped to focus on Linux and Mac OS.
Some functionality may work on Windows but no support will be provided, use WSL instead.
source: https://github.com/tinygrad/tinygrad/releases/tag/v0.7.0

For this opening statement to be true, it would need to include windows-based systems, especially old gaming rigs.

Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!

At the very least, a thorough guide on setting up tinygrad via WSL/WSL2 would be appreciated, because this is your only documentation:

Example Usage on Multiple MacOS Devices

@bayedieng
Copy link
Contributor

I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support.

@AlexCheema
Copy link
Contributor Author

I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support.

Go for it!

@AlexCheema
Copy link
Contributor Author

@bayedieng I'd recommend looking at https://github.com/abetlen/llama-cpp-python -- it should hopefully be low level enough to do what we need to do. Also, I'd recommend looking at #139 for a minimal implementation of an inference engine that doesn't require explicitly defining every model -- it's a general solution.

@bayedieng
Copy link
Contributor

Thanks for the suggestion. Yeah I had already seen the python bindings and went ahead and began a draft PR.

@thegodone
Copy link

I wonder if WebGPU can be plugged on top of Llama.cpp via this https://github.com/AnswerDotAI/gpu.cpp wrapper ?

@bayedieng
Copy link
Contributor

It would seem that the LLAMA CPP API is too high level to perform sharded inference as it doesn't provide access to individual layers. In order to do so, one would have to use the GGML bindings directly to create a suitable inference engine compatible with Exo.

It was stated that it would be nice to have a similar API as the Pytorch engine where a base model and reused everywhere as provided with AutoModelForCausalLM however, GGML does not provide a simmilar API for inference, so each model would have to likely be implmented the same way that tinygrad and mlx does it.

@AlexCheema please let me know if you'd like to go forward and build the inference engine with ggml bindings.

@danny-avila
Copy link

Heads up, LocalAI now has distributed inference:

https://localai.io/features/distribute/

@bayedieng
Copy link
Contributor

Heads up, LocalAI now has distributed inference:

https://localai.io/features/distribute/

Not familiar with them but they also likely just use the underlying GGML API as well, considering LLama CPP does inference end-to-end. LLama CPP also has a distributed inference example using GGML:

https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

@bayedieng bayedieng mentioned this issue Oct 12, 2024
3 tasks
@bayedieng
Copy link
Contributor

I've been caught up recently in a whole bunch of work recently and have had little time to work on the LLama support. I wouldn't would to hold the PR hostage if someone's capable of completing it go ahead!

@bayedieng bayedieng removed their assignment Oct 25, 2024
@SureD
Copy link

SureD commented Dec 8, 2024

agree

Heads up, LocalAI now has distributed inference:
https://localai.io/features/distribute/

Not familiar with them but they also likely just use the underlying GGML API as well, considering LLama CPP does inference end-to-end. LLama CPP also has a distributed inference example using GGML:

https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

Thanks for the info. I looked through the RPC example and found out that it uses the GGML RPC backend. With other backends, it is scheduled by GGML, which requires all nodes to be GGML nodes.

However, in Exo, each node is independent, and some nodes may not use the GGML backend. So, I think we can just use the GGML backend API and then wrap it up for Exo.

vs4vijay added a commit to vs4vijay/exo that referenced this issue Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants