Skip to content

Commit

Permalink
Update server-side-rag.md
Browse files Browse the repository at this point in the history
  • Loading branch information
juntao authored Mar 31, 2024
1 parent 1704095 commit 306f95c
Showing 1 changed file with 10 additions and 46 deletions.
56 changes: 10 additions & 46 deletions docs/user-guide/server-side-rag.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,40 @@
sidebar_position: 3
---


# Server-side RAG with LlamaEdge

RAG is an important technique to inject new and updated knowledge into an LLM application. It improves accuracy and reduces the hallucination of LLMs, especially in specific domains covered by the RAG knowledge base. In the past, most RAG setups are very complex requiring the orchestration of multiple services and components in heavyweight frameworks in Python.

LlamaEdge, on the other hand, provides a Rust-based development platform that enables developers to combine RAG logic into the chat API server! The compact and efficient single-binary API server is cross-platform and runs on any GPU or AI accelerator device, which allows it to be easily deployed across the cloud and the edge.


>Since the LlamaEdge API server encapsulates the RAG logic behind the OpenAI-compatible web service API, you can use any OpenAI-compatible frontend to interact with it. When you ask a question through the API, the answer is already RAG-enhanced. No client-side RAG (i.e., to index and search the RAG vector store in the client) is needed.

The LlamaEdge API server is a powerful demo of the LlamaEdge development platform. It showcases how to use the LlamaEdge SDK to build a generic RAG application. You can, of course, implement your own RAG logic using the LlamaEdge SDK and build your own RAG API server! In this tutorial, we will explain how to use the default RAG features built into our standard API server.


## Prerequisites

Install WasmEdge Runtime, our cross-platform LLM runtime.

Install the [WasmEdge Runtime](https://github.com/WasmEdge/WasmEdge), our cross-platform LLM runtime.

```
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- -v 0.13.5 --plugins wasi_nn-ggml wasmedge_rustls
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
```


Download the pre-built binary for the LlamaEdge API server.


```
curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-api-server.wasm
```


And the chatbot web UI for the API server.


```
curl -LO https://github.com/second-state/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz
```


Download a chat model and an embedding model.


```
# The chat model is Llama2 7b chat
curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/Llama-2-7b-chat-hf-Q5_K_M.gguf
Expand All @@ -55,14 +44,11 @@ curl -LO https://huggingface.co/second-state/Llama-2-7B-Chat-GGUF/resolve/main/L
curl -LO https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-ggml-model-f16.gguf
```


The embedding model is a special kind of LLM that turns sentences into vectors. The vectors can then be stored in a vector database and searched later. When the sentences are from a body of text that represents a knowledge domain, that vector database becomes our RAG knowledge base.


## Start a vector database

By default, we Qdrant as the vector database. You can start a Qdrant instance on your server using Docker. The following command starts it in the background.

By default, we use Qdrant as the vector database. You can start a Qdrant instance on your server using Docker. The following command starts it in the background.

```
mkdir qdrant_storage
Expand All @@ -74,15 +60,12 @@ nohup docker run -d -p 6333:6333 -p 6334:6334 \
qdrant/qdrant
```



## Create knowledge embeddings

The LlamaEdge API server provides an API endpoint `/files` that takes a text file, segments it into small chunks, turns the chunks into embeddings (i.e., vectors), and then stores the embeddings into the Qdrant database.
The LlamaEdge API server provides an API endpoint `/create_rag` that takes a text file, segments it into small chunks, turns the chunks into embeddings (i.e., vectors), and then stores the embeddings into the Qdrant database. Note that for many use cases, you will need to create your own knowledge embeddings. You can jump ahead to the [Use your own embedding algos](#use-your-own-embedding-algos) section to learn how to do that!

Let’s start the LlamaEdge API server with Qdrant first. The LlamaEdge API server will be started on port 8080 by default.


```
wasmedge --dir .:. \
--nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf \
Expand All @@ -97,7 +80,6 @@ wasmedge --dir .:. \
--log-prompts --log-stat
```


The CLI arguments for WasmEdge are self-explanatory.

* The `--nn-proload` loads two models we just downloaded. The chat model is named `default` and the embedding model is named `embedding` .
Expand All @@ -107,29 +89,25 @@ The CLI arguments for WasmEdge are self-explanatory.
* The `--qdrant-url` is the API URL to the Qdrant server we plan to use.
* The other `--qdrant-*` arguments will be explained later when we discuss chatting with the RAG server.


Next, you can submit a link to a text document for it to turn into embeddings. The document here is a travel guide for Paris France.


```
curl -LO https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt
curl -X POST http://127.0.0.1:8080/v1/files -F "[email protected]"
curl -X POST http://127.0.0.1:8080/v1/create_rag -F "[email protected]"
```


Now, the Qdrant database has a vector collection called `default` which contains embeddings from the Paris guide. You can see the stats of the vector collection as follows.


```
curl 'http://localhost:6333/collections/default'
```

Of course, the `/create_rag` API is rather primitive in chunking documents and creating embeddings. For many use cases, you should [create your own embedding vectors](#use-your-own-embedding-algos).

Of course, the `/files` API is rather primitive in chunking documents and creating embeddings. For production use cases, you should create your own embedding databases. See more in this later section.
> The `/create_rag` is a simple combination of several more basic API endpoints provided by the API server. You can learn more about them in the developer guide.

## Chat with RAG
## Chat with supplemental RAG knowledge

When you start the LlamaEdge API server with Qdrant options, it will take every new user request, search relevant embeddings based on the request, and then add search results to the prompt.

Expand Down Expand Up @@ -200,10 +178,7 @@ curl -X POST http://localhost:8080/v1/chat/completions \
}
```




## Create a production-ready embedding database
## Use your own embedding algos

You could build your own embeddings database. By chunking the documents yourself, you will probably get better results. You can also then share the database snapshot with others.

Expand All @@ -230,21 +205,16 @@ curl -X PUT 'http://localhost:6333/collections/default' \
}'
```



Download a program to chunk a document and create embeddings.


```
curl -LO https://github.com/YuanTony/chemistry-assistant/raw/main/rag-embeddings/create_embeddings.wasm
```


It chunks the document based on empty lines. So, you MUST prepare your source document this way — to segment the document into sections of around 200 words with empty lines. See the example document here. You can check out the [Rust source code here](https://github.com/YuanTony/chemistry-assistant/tree/main/rag-embeddings) and modify it if you need to use a different chunking strategy.

Next, you can run the program by passing a collection name, vector dimension, and the source document. Make sure that Qdrant is running on your local machine. The model is preloaded under the name `embedding`. The wasm app then uses the `embedding` model to create the 384-dimension vectors from `[paris.txt](https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt)` and saves them into the `default` collection.


```
curl -LO https://huggingface.co/datasets/gaianet/paris/raw/main/paris.txt
Expand All @@ -253,32 +223,26 @@ wasmedge --dir .:. \
create_embeddings.wasm embedding default 384 paris.txt
```


You can create a snapshot of the collection, which can be shared and loaded into a different Qdrant database. You can find the snapshot file in the `qdrant_snapshots` directory.


```
curl -X POST 'http://localhost:6333/collections/default/snapshots'
```


If you already have a snapshot file `paris.snapshot`, you can import it as follows. [Learn more about creating and restoring Qdrant snapshots](https://qdrant.tech/documentation/tutorials/create-snapshot/).


```
curl -X POST 'http://localhost:6333/collections/default/snapshots/upload?priority=snapshot' \
-H 'Content-Type:multipart/form-data' \
-F '[email protected]'
```


Finally, start your LlamaEdge API server again. Go to http://localhost:8080/ again and have it answer questions based on your new knowledge base!


## Next steps

Now it is time to build your own RAG-enabled API server! You can start by using the same embedding model but with a different document. Make sure that you segment the document into chucks of less than 200 words each, and separate each chunk with an empty line. You can experiment with different chucking strategies to evaluate how to come up with the best search results when the user asks new questions.

Next, you can experiment with a different embedding model. Notice that each embedding model has a different context length (ie max length for each chunk) and vector size. You must also use the same embedding model to generate embeddings and perform RAG search / chat.

Good luck!
Good luck!

0 comments on commit 306f95c

Please sign in to comment.