All Kinds of Supported Inference Backends

If you want to integrate more backends into llmaz, please refer to this PR. It's always welcomed.

llama.cpp

llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

SGLang

SGLang is yet another fast serving framework for large language models and vision language models.

Text-Generation-Inference

text-generation-inference is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

ollama

ollama is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support-backends.md

support-backends.md

All Kinds of Supported Inference Backends

llama.cpp

SGLang

Text-Generation-Inference

ollama

vLLM

Files

support-backends.md

Latest commit

History

support-backends.md

File metadata and controls

All Kinds of Supported Inference Backends

llama.cpp

SGLang

Text-Generation-Inference

ollama

vLLM