-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
10 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,51 +2,40 @@ | |
sidebar_position: 3 | ||
--- | ||
|
||
|
||
# Server-side RAG with LlamaEdge | ||
|
||
RAG is an important technique to inject new and updated knowledge into an LLM application. It improves accuracy and reduces the hallucination of LLMs, especially in specific domains covered by the RAG knowledge base. In the past, most RAG setups are very complex requiring the orchestration of multiple services and components in heavyweight frameworks in Python. | ||
|
||
LlamaEdge, on the other hand, provides a Rust-based development platform that enables developers to combine RAG logic into the chat API server! The compact and efficient single-binary API server is cross-platform and runs on any GPU or AI accelerator device, which allows it to be easily deployed across the cloud and the edge. | ||
|
||
|
||
>Since the LlamaEdge API server encapsulates the RAG logic behind the OpenAI-compatible web service API, you can use any OpenAI-compatible frontend to interact with it. When you ask a question through the API, the answer is already RAG-enhanced. No client-side RAG (i.e., to index and search the RAG vector store in the client) is needed. | ||
|
||
The LlamaEdge API server is a powerful demo of the LlamaEdge development platform. It showcases how to use the LlamaEdge SDK to build a generic RAG application. You can, of course, implement your own RAG logic using the LlamaEdge SDK and build your own RAG API server! In this tutorial, we will explain how to use the default RAG features built into our standard API server. | ||
|
||
|
||
## Prerequisites | ||
|
||
Install WasmEdge Runtime, our cross-platform LLM runtime. | ||
|
||
Install the [WasmEdge Runtime](https://github.com/WasmEdge/WasmEdge), our cross-platform LLM runtime. | ||
|
||
``` | ||
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- -v 0.13.5 --plugins wasi_nn-ggml wasmedge_rustls | ||
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml | ||
``` | ||
|
||
|
||
Download the pre-built binary for the LlamaEdge API server. | ||
|
||
|
||
``` | ||
curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm | ||
``` | ||
|
||
|
||
And the chatbot web UI for the API server. | ||
|
||
|
||
``` | ||
curl -LO https://github.com/second-state/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz | ||
tar xzf chatbot-ui.tar.gz | ||
rm chatbot-ui.tar.gz | ||
``` | ||
|
||
|
||
Download a chat model and an embedding model. | ||
|
||
|
||
``` | ||
# The chat model is Llama2 7b chat | ||
curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf | ||
|
@@ -55,14 +44,11 @@ curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/L | |
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf | ||
``` | ||
|
||
|
||
The embedding model is a special kind of LLM that turns sentences into vectors. The vectors can then be stored in a vector database and searched later. When the sentences are from a body of text that represents a knowledge domain, that vector database becomes our RAG knowledge base. | ||
|
||
|
||
## Start a vector database | ||
|
||
By default, we Qdrant as the vector database. You can start a Qdrant instance on your server using Docker. The following command starts it in the background. | ||
|
||
By default, we use Qdrant as the vector database. You can start a Qdrant instance on your server using Docker. The following command starts it in the background. | ||
|
||
``` | ||
mkdir qdrant_storage | ||
|
@@ -74,15 +60,12 @@ nohup docker run -d -p 6333:6333 -p 6334:6334 \ | |
qdrant/qdrant | ||
``` | ||
|
||
|
||
|
||
## Create knowledge embeddings | ||
|
||
The LlamaEdge API server provides an API endpoint `/files` that takes a text file, segments it into small chunks, turns the chunks into embeddings (i.e., vectors), and then stores the embeddings into the Qdrant database. | ||
The LlamaEdge API server provides an API endpoint `/create_rag` that takes a text file, segments it into small chunks, turns the chunks into embeddings (i.e., vectors), and then stores the embeddings into the Qdrant database. Note that for many use cases, you will need to create your own knowledge embeddings. You can jump ahead to the [Use your own embedding algos](#use-your-own-embedding-algos) section to learn how to do that! | ||
|
||
Let’s start the LlamaEdge API server with Qdrant first. The LlamaEdge API server will be started on port 8080 by default. | ||
|
||
|
||
``` | ||
wasmedge --dir .:. \ | ||
--nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf \ | ||
|
@@ -97,7 +80,6 @@ wasmedge --dir .:. \ | |
--log-prompts --log-stat | ||
``` | ||
|
||
|
||
The CLI arguments for WasmEdge are self-explanatory. | ||
|
||
* The `--nn-proload` loads two models we just downloaded. The chat model is named `default` and the embedding model is named `embedding` . | ||
|
@@ -107,29 +89,25 @@ The CLI arguments for WasmEdge are self-explanatory. | |
* The `--qdrant-url` is the API URL to the Qdrant server we plan to use. | ||
* The other `--qdrant-*` arguments will be explained later when we discuss chatting with the RAG server. | ||
|
||
|
||
Next, you can submit a link to a text document for it to turn into embeddings. The document here is a travel guide for Paris France. | ||
|
||
|
||
``` | ||
curl -LO https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt | ||
curl -X POST http://127.0.0.1:8080/v1/files -F "[email protected]" | ||
curl -X POST http://127.0.0.1:8080/v1/create_rag -F "[email protected]" | ||
``` | ||
|
||
|
||
Now, the Qdrant database has a vector collection called `default` which contains embeddings from the Paris guide. You can see the stats of the vector collection as follows. | ||
|
||
|
||
``` | ||
curl 'http://localhost:6333/collections/default' | ||
``` | ||
|
||
Of course, the `/create_rag` API is rather primitive in chunking documents and creating embeddings. For many use cases, you should [create your own embedding vectors](#use-your-own-embedding-algos). | ||
|
||
Of course, the `/files` API is rather primitive in chunking documents and creating embeddings. For production use cases, you should create your own embedding databases. See more in this later section. | ||
> The `/create_rag` is a simple combination of several more basic API endpoints provided by the API server. You can learn more about them in the developer guide. | ||
|
||
## Chat with RAG | ||
## Chat with supplemental RAG knowledge | ||
|
||
When you start the LlamaEdge API server with Qdrant options, it will take every new user request, search relevant embeddings based on the request, and then add search results to the prompt. | ||
|
||
|
@@ -200,10 +178,7 @@ curl -X POST http://localhost:8080/v1/chat/completions \ | |
} | ||
``` | ||
|
||
|
||
|
||
|
||
## Create a production-ready embedding database | ||
## Use your own embedding algos | ||
|
||
You could build your own embeddings database. By chunking the documents yourself, you will probably get better results. You can also then share the database snapshot with others. | ||
|
||
|
@@ -230,21 +205,16 @@ curl -X PUT 'http://localhost:6333/collections/default' \ | |
}' | ||
``` | ||
|
||
|
||
|
||
Download a program to chunk a document and create embeddings. | ||
|
||
|
||
``` | ||
curl -LO https://github.com/YuanTony/chemistry-assistant/raw/main/rag-embeddings/create_embeddings.wasm | ||
``` | ||
|
||
|
||
It chunks the document based on empty lines. So, you MUST prepare your source document this way — to segment the document into sections of around 200 words with empty lines. See the example document here. You can check out the [Rust source code here](https://github.com/YuanTony/chemistry-assistant/tree/main/rag-embeddings) and modify it if you need to use a different chunking strategy. | ||
|
||
Next, you can run the program by passing a collection name, vector dimension, and the source document. Make sure that Qdrant is running on your local machine. The model is preloaded under the name `embedding`. The wasm app then uses the `embedding` model to create the 384-dimension vectors from `[paris.txt](https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt)` and saves them into the `default` collection. | ||
|
||
|
||
``` | ||
curl -LO https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt | ||
|
@@ -253,32 +223,26 @@ wasmedge --dir .:. \ | |
create_embeddings.wasm embedding default 384 paris.txt | ||
``` | ||
|
||
|
||
You can create a snapshot of the collection, which can be shared and loaded into a different Qdrant database. You can find the snapshot file in the `qdrant_snapshots` directory. | ||
|
||
|
||
``` | ||
curl -X POST 'http://localhost:6333/collections/default/snapshots' | ||
``` | ||
|
||
|
||
If you already have a snapshot file `paris.snapshot`, you can import it as follows. [Learn more about creating and restoring Qdrant snapshots](https://qdrant.tech/documentation/tutorials/create-snapshot/). | ||
|
||
|
||
``` | ||
curl -X POST 'http://localhost:6333/collections/default/snapshots/upload?priority=snapshot' \ | ||
-H 'Content-Type:multipart/form-data' \ | ||
-F '[email protected]' | ||
``` | ||
|
||
|
||
Finally, start your LlamaEdge API server again. Go to http://localhost:8080/ again and have it answer questions based on your new knowledge base! | ||
|
||
|
||
## Next steps | ||
|
||
Now it is time to build your own RAG-enabled API server! You can start by using the same embedding model but with a different document. Make sure that you segment the document into chucks of less than 200 words each, and separate each chunk with an empty line. You can experiment with different chucking strategies to evaluate how to come up with the best search results when the user asks new questions. | ||
|
||
Next, you can experiment with a different embedding model. Notice that each embedding model has a different context length (ie max length for each chunk) and vector size. You must also use the same embedding model to generate embeddings and perform RAG search / chat. | ||
|
||
Good luck! | ||
Good luck! |