ClipServe: โก AI-powered image and text analysis ๐ผ๏ธ๐, plus zero-shot classification ๐ฏ โ all at lightning speed!
ClipServe is a blazing-fast inference server built on top of the powerful OpenAI CLIP model ๐ผ๏ธ๐. It provides easy-to-use API endpoints for embedding texts, embedding images, and performing zero-shot classification. With ClipServe, you can seamlessly integrate CLIP's capabilities into your applications with minimal overhead.
- ๐ Text Embedding: Extract powerful embeddings for your texts using the CLIP model.
- ๐ผ๏ธ Image Embedding: Convert your images into feature-rich embeddings in a snap.
- ๐ Zero-Shot Classification: Perform zero-shot classification on multiple images and labels without any additional fine-tuning.
- โก Powered by CUDA (or not): Experience lightning-fast inference powered by CLIP with CUDA for GPU acceleration, or run seamlessly on CPU-only for broader compatibility.
- ๐ API-Driven: Leverage the flexibility of a REST API built with FastAPI for scalable and robust integrations.
- ๐งฐ Redis Queue: Efficient task management and concurrency with Redis for high-throughput systems.
- FastAPI: Fast and intuitive Python web framework.
- Redis: Asynchronous task queue for managing inference requests.
- CLIP: Multimodal vision-language model from OpenAI, utilized through the Hugging Face Transformers library for seamless integration.
- CPU or GPU: Supports inference on both GPU for accelerated performance or CPU for broader accessibility.
- Docker ๐ณ: Install Docker with the Docker Compose plugin (Overview of Installing Docker).
- GPU Requirements ๐ป: For GPU-enabled Docker Compose, you need an NVIDIA GPU with updated drivers and the NVIDIA Container Toolkit (Installing NVIDIA Container Toolkit)
Warning
Make sure you follow this installation order for proper setup and compatibility. For more information, refer to the official installation guides of each dependency.
-
Clone the repository:
git clone https://github.com/Armaggheddon/ClipServe cd ClipServe
-
Build the containers:
- Cpu only version
docker compose -f cpu-docker-compose.yml build
- Gpu enabled version
docker compose -f gpu-docker-compose.yml build
- Cpu only version
-
Start the container:
- Cpu only version:
docker compose -f cpu-docker-compose.yml up
- Gpu enabled version:
docker compose -f gpu-docker-compose.yml up
- Cpu only version:
Tip
Add option -d
to the start command to start the containers in detached mode:
docker compose -f cpu/gpu-docker-compose.yml up -d
ClipServe offers a variety of customization options through two environment configuration files: container_configs.env
and .env
.
This file allows you to configure key aspects of the application, including API documentation visibility and the CLIP model to use for inference.
-
SHOW_API_DOCS
: Set totrue
orfalse
to show or hide the OpenAPI documentation for the API. -
CLIP_MODEL_NAME
: Choose which CLIP model to use for inference. Available models:openai/clip-vit-base-patch32
openai/clip-vit-large-patch14
openai/clip-vit-base-patch16
openai/clip-vit-large-patch14-336
This file is used to configure the exposed ports for both the API and the web UI.
WEB_API_EXPOSED_PORT
: Set the port for accessing the API.WEB_UI_EXPOSED_PORT
: Set the port for accessing the web UI.
If you don't need the Gradio-powered web UI, you can easily disable it by commenting out or removing the corresponding service in the cpu/gpu-docker-compose.yml
file:
services:
# web_ui:
# build:
# context: ./web_ui
# dockerfile: Dockerfile_webui
# ports:
# - "${WEB_UI_EXPOSED_PORT}:7860"
# depends_on:
# - api
api:
build:
...
These configurations make ClipServe flexible and adaptable to different use cases. Customize it to fit your needs! ๐ ๏ธ
Embed one or multiple pieces of text
- Method:
POST
- Request:
OR
{ "text": "text to embed" }
{ "text": [ "text to embed1", ... ] }
- Response:
{ "model_name": "openai/clip-vit-base-patch32", "text_embeddings": [ { "text": "text to embed", "embedding": [ 0.10656972229480743, ... ] }, ... ] }
Embed one or multiple images. The images are sent as base64 encoded strings with the uri metadata, e.g. data:image/jpeg;base64,<base64 encoded image>
.
- Method:
POST
- Request:
OR
{ "image_b64": "data:image/jpeg;base64,<base64 encoded image>" }
{ "image_b64": [ "data:image/jpeg;base64,<base64 encoded image>", ... ] }
- Response:
{ "model_name": "openai/clip-vit-base-patch32", "image_embeddings": [ { "image_id": "uuid_for_images_in_request", "embedding": [ -0.20458175241947174, ... ] }, ... ] }
Perform zero-shot classification on images given a list of text labels.
- Method:
POST
- Request:
{ "labels": [ "label1", ... ], "images_b64": [ "data:image/jpeg;base64,<base64 encoded image>", ... ] }
- Response:
{ "model_name": "openai/clip-vit-base-patch32", "text_embeddings": [ { "text": "label1", "embedding": [ -0.21665547788143158, ... ] }, ... ], "image_embeddings": [ { "image_id": "uuid1", "embedding": [ 0.48072099685668945, ... ] }, ... ], "classification_result": { "labels": [ "label1", ... ], "softmax_outputs": [ { "image_id": "uuid1", "softmax_scores": [ 0.876521455, ... ] }, ... ] } }
Hereโs a glimpse of ClipServe in action:
Easily explore and test the API with the built-in OpenAPI documentation served at localhost:8000/docs
.
Interact with the model directly via the Gradio-powered web UI for an intuitive experience, served at localhost:8080
.
To get started with ClipServe, weโve included some example code in the client_example
folder. This will help you quickly interact with the API endpoints for embedding text, embedding images, and performing zero-shot classification.
- Make sure ClipServe is up and running using Docker Compose.
- Navigate to the client_example folder and execute the provided scripts.
Hereโs an example of how to use the text embedding API:
import requests
# URL of the ClipServe API
api_url = "http://localhost:<WEB_API_EXPOSED_PORT>/embed-text"
# Sample text data
data = {
"text": [
"A photo of a cat",
"A beautiful landscape with mountains"
]
}
# Make a POST request to the API
response = requests.post(api_url, json=data)
# Display the results
if response.status_code == 200:
print(response.json())
else:
print(f"Error: {response.status_code}")
For more a more detailed example, check out the client_example.py
file, which contains code for text embedding, image embedding, and zero-shot classification.
The clip_serve_models.py
file includes all the required models that make it easier to operate with the API. These models are provided to help you format requests and handle responses more effectively.
ClipServe is written in Python and uses a few key libraries to enable fast, scalable, and efficient multimodal inference.
-
๐ค Transformers (by Hugging Face): Used for the CLIP model, enabling text and image embedding, as well as zero-shot classification.
-
๐ฅ Redis: Acts as a message broker for handling asynchronous task queues between the API and inference backend.
-
โก FastAPI: Provides the API framework, offering fast, async request handling and automatic OpenAPI documentation.
Weโd love to see your contributions! Found a bug? Have a feature idea? Open an issue or submit a pull request. Letโs build something awesome together! ๐ช
This project is licensed under the MIT License, so feel free to use it, modify it, and share it. ๐