-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R2R ollama Docker GPU Support #770
Comments
Thanks for taking the time to share that this is an available feature in ollama! We can certainly add this. Do you have any idea how ollama w/ gpu support compares with vllm throughput? Perhaps we'd be better off bundling vllm for users with GPUs? vllm is optimized for GPU use cases and last I checked it offered significant speedups. |
Thanks for the quick reply and also taking the time to address my feature request. I opted for simplicity of adding the feature and wide user base for easy troubleshooting in my FR. But if you would consider putting more resources into GPU enabling the docker deployment, would be amazing. Vllm should be significantly faster and efficient for local inference. I'm unsure how well embeddings work since there seemed to be issues in the past. Adding vllm as a more efficient way if it works with the required setup would be an ideal next step. If there are some temporary hurdles to get vllm working, a GPU ollama implementation would be great anyways since it doesn't seem to interfere with anything else. I could give it a try with vllm if it would help the R2R community and you devs. |
Please do so and let us know how it goes, it should be easy to use thanks to LiteLLM. If you'd like to help author the PR to get vllm into the codebase we are happy to provide support. Otherwise, it is on our todo list and we can get back to you when it is fully online =). |
can we use r2r outside of docker on arch? |
vLLM just recently merged embedding support here. But what I see is they basically just support LLM based embedding generation. Which are basically fine-tuned regular LLMs that use an instruction, then take out the vector embeddings from the answer. Advantage: They rank very high on the Massive Text Embedding Benchmark Leaderboar, but take a ton of resources, as it's as intense as an LLM often with a min of 2B+ parameters. The advantage is, to my understanding, you can run them on any backend that can do regular LLM inference. Usually we can reach almost similar SOTA Embedding quality by using SentenceTransformer: Many of the most widely used embedding models have SentenceTransformer support: Even FlagEmbeddings have a SentenceTransformer implementation, though not supporting some of the cool features that FlagEmbeddings have: https://huggingface.co/BAAI/bge-base-en-v1.5 Sadly afaik even ollama doesn't support regular SentenceTransformer Models yet and just has support for a few embedding models. While going the vLLM route is quite cool for us people who use R2R with enterprise clients (like I mainly do), supporting ollama first, is maybe a better strategy. You would go from simple to complex. But consider choosing whatever supports SentenceTransformers now or in the futures, as you would get a broad range of embedding model support. |
Thank you so much for this project and your efforts to make GraphRAG accessible for the masses!
Is your feature request related to a problem? Please describe.
Systems with an appropriate GPU(s) might prefer to run the models of the local ollama Docker deployment with the GPU support and could do so by editing their compose file. However, implementing this into R2R directly might be a valuable addition for non-mac users.
Describe the solution you'd like
Enable R2R to run with docker containers with native CUDA GPU support by passing the GPU flag.
This might be implemented with the
--gpus=all
flag e.g.docker run --gpus=all -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
or preferably within the compose file e.g.
By setting
count: all
, you're allowing the container to access all available GPUs on your system. If you want to limit it to a specific number of GPUs, you can replaceall
with a number, like2
to use only two GPUs.From my understanding, this might also be solved by creating a flag like e.g.
use-gpu-all
oruse-gpu-n
(whilen
is the number of GPUs) to pass the GPU argument to a compose file or docker commands.Describe alternatives you've considered
Running on CPU and RAM, but much slower and not so efficient.
Thanks again and best wishes.
The text was updated successfully, but these errors were encountered: