Chunk documents.
Chonkit is an application for chunking documents whose chunks can then be used for retrieval augmented generation (RAG).
RAG is a technique to provide LLMs contextual information about arbitrary data. The jist of RAG is the following:
- User sends a prompt.
- Prompt is used for semantic search to retrieve context from the knowledge base.
- Context and prompt are sent to LLM, providing it the necessary information to answer the prompt accurately.
Chonkit focuses on problem 2.
Documents come in many different shapes and sizes. A parser is responsible for turning its content into bytes (raw text) and forwarding them to the chunkers. Parsers can be configured to read only a specific range from the document, and they can be configured to skip arbitrary text elements.
Chonkit provides an API to configure parsers for fast iteration.
Embedding and retrieving whole documents is unfeasible as they can be massive, so we need some way to split them up into smaller parts, but still retain information clarity.
Chonkit currently offers 3 flavors of chunkers:
- SlidingWindow - the simplest (and worst performing) chunking implementation.
- SnappingWindow - a better heuristic chunker that retains sentence stops.
- SemanticWindow - an experimental chunker that uses embeddings and their distances to determine chunk boundaries.
The optimal flavor depends on the document being chunked. There is no perfect chunking flavor and finding the best one will be a game of trial and error, which is why it is important to get fast feedback when chunking.
Chonkit provides APIs to configure how documents get chunked, as well as a preview API for fast iteration.
Once the documents are chunked, we have to store them somehow. We do this by embedding them into vectors and storing them to a collection in a vector database. Vector databases are specialised software used for efficient storage of these vectors and their retrieval.
Chonkit provides APIs to manipulate vector collections and store embeddings into them.
Chonkit uses a modular architecture that allows for easy integration of new vector database, embedding, and document storage providers. This section lists the available providers and their corresponding feature flags.
Provider | Feature | Description |
---|---|---|
Qdrant | qdrant |
Enable qdrant as one of the vector database providers. |
Weaviate | weaviate |
Enable weaviate as one of the vector database providers. |
Provider | Feature | Description |
---|---|---|
OpenAI | openai |
Enable OpenAI as one of the embedding providers. |
Fastembed | fe-local / fe-remote |
Enable Fastembed as one of the embedding providers. The local implementation uses the current machine to embed, the remote implementation uses a remote server and needs a URL to connect to. When running locally the cuda feature flag will enable CUDA support and will fallback to the CPU if a CUDA capable device is not found. |
Provider | Feature | Capabilities |
---|---|---|
Local | Always enabled. | read/write |
Google Drive | gdrive |
read |
Uses the machine's file system to store documents. Always enabled and cannot be disabled.
When enabled, allows files to be imported from Google Drive.
Google Drive only accepts tokens generated by OAuth clients, therefore you need to set one up with a Google project.
To use any of the routes for importing files, authorization with Google is required beforehand. A code exchange route is exposed for obtaining an access token with read permissions for Drive.
When accessing any of this provider's routes, the access token must either be
in the google_drive_access_token
cookie, or in the X-Google-Drive-Acess-Token
header.
All files imported from Drive will be downloaded into the directory provided on application startup (see table below). This means changes from Drive will not be reflected in Chonkit unless manually refreshed. There is a route that lists all files imported from Drive and compares the local modification time with the current modification time of the file. If the external modification time is newer, the file will be re-downloaded.
Argument | Environment variable | Description | Default |
---|---|---|---|
--google-drive-download-path |
GOOGLE_DRIVE_DOWNLOAD_PATH |
The directory to download files to when importing from Drive. | ./upload/gdrive |
--google-oauth-client-id |
GOOGLE_OAUTH_CLIENT_ID |
The client ID of the OAuth client. | - |
--google-oauth-client-secret |
GOOGLE_OAUTH_CLIENT_SECRET |
The client secret of the OAuth. | - |
This workspace consists the following binaries:
- chonkit; exposes an HTTP API around
chonkit
's core functionality. - feserver; used to initiate fastembed with CUDA and expose an HTTP API for embeddings.
Chonkit depends on pdfium_render
to parse PDFs. This library depends on libpdfium.so.
In order for compilation to succeed, the library must be installed on the system.
To download a version of libpdfium
compatible with chonkit (6666),
run the following (assuming Linux):
mkdir pdfium
wget https://github.com/bblanchon/pdfium-binaries/releases/download/chromium%2F6666/pdfium-linux-x64.tgz -O - | tar -xzvf - -C ./pdfium
The library can be found in ./pdfium/lib/libpdfium.so
.
In order to let cargo know of its existence, you have 2 options:
-
Set the
LD_LIBRARY_PATH
environment variable.-
By default, the GNU linker is set up to search for libraries in
/usr/lib
. If you copy thelibpdfium.so
into one of those directories, you do not need to need to set this variable. However, if you want to use the library from a different location, you need to tell the linker where it is:export LD_LIBRARY_PATH=/path/to/dir/containing/pdfium:$LD_LIBRARY_PATH
Note: You need to pass the directory that contains the
libpdfium.so
file, not the file itself. This command could also be placed in your.rc
file.
-
-
Copy the
libpdfium.so
file to/usr/lib
.
The latter is the preferred option as it is the least involved.
See also: rpath.
Note: The same procedure is applicable on Mac, only the paths and actual library files will be different.
- Required when compiling with
fe-local
.
Fastembed models require an onnxruntime. This library can be downloaded from here, or via the system's native package manager.
- Required when compiling with
fe-local
andcuda
.
If using the cuda
feature flag with fastembed
, the system will need to have
the CUDA toolkit installed.
Fastembed, and in turn ort
, will then use the CUDA execution provider for the
onnxruntime. ort
is designed to fail gracefully if it cannot register CUDA as
one of the execution providers and the CPU provider will be used as fallback.
Additionally, if running feserver
with Docker, these instructions
need to be followed to enable GPUs in Docker.
The following is a table of the supported build features.
Feature | Configuration | Description |
---|---|---|
qdrant |
VectorDb provider | Enable qdrant as one of the vector database providers. |
weaviate |
VectorDb provider | Enable weaviate as one of the vector database providers. |
fe-local |
Embedder provider | Use the implementation of Embedder with LocalFastEmbedder . Mutually exclusive with fe-remote . |
fe-remote |
Embedder provider | Use the implementation of Embedder with RemoteFastEmbedder . Mutually exclusive with fe-local . |
openai |
Embedder provider | Enable openai as one of the embedding providers. |
cuda |
Execution provider | Available when using fe-local . When enabled, uses the CUDAExecutionProvider for the onnxruntime. |
gdrive |
Storage provider | Enable Google Drive as one of the document storage providers. |
cargo build -F "qdrant weaviate fe-local" --release
By default, Chonkit uses sqlx with Postgres.
During compilation, sqlx will use the DATABASE_URL
environment variable to
connect to the database. In order to prevent this default behaviour, run
cargo sqlx prepare --merged
This will cache the queries needed for 'offline' compilation.
The cached queries are stored in the .sqlx
directory and are checked
into version control. You can check whether the build works by unsetting
the DATABASE_URL
environment variable.
unset DATABASE_URL
cp .example.env .env
source setup.sh
Creates the 'data/upload' and 'data/gdrive' directories for storing documents. Starts the infrastructure containers (postgres, qdrant, weaviate). Exports the necessary environment variables to run chonkit.
Run
source setup.sh -h
to see all the available options for the setup script.
Chonkit accepts the following arguments:
Arg | Env | Feature | Default | Description |
---|---|---|---|---|
--db-url |
DATABASE_URL |
* | - | The database URL. |
--log |
RUST_LOG |
* | info |
The RUST_LOG env filter string to use. |
--upload-path |
UPLOAD_PATH |
* | ./upload |
If using the FsDocumentStore , sets its upload path. |
--address |
ADDRESS |
* | 0.0.0.0:42069 |
The address (host:port) to bind the server to. |
--cors-allowed-origins |
CORS_ALLOWED_ORIGINS |
* | - | Comma separated list of origins allowed to connect. |
--cors-allowed-headers |
CORS_ALLOWED_HEADERS |
* | - | Comma separated list of accepted headers. |
--cookie-domain |
COOKIE_DOMAIN |
* | localhost |
Which domain to set on cookies. |
--qdrant-url |
QDRANT_URL |
qdrant |
- | Qdrant vector database URL. |
--weaviate-url |
WEAVIATE_URL |
weaviate |
- | Weaviate vector database URL. |
--fembed-url |
FEMBED_URL |
fe-remote |
- | Remote fastembed URL. |
- | OPENAI_KEY |
openai |
- | OpenAI API key. |
The arguments have priority over the environment variables.
See RUST_LOG
syntax here.
See Authorization for more information about authz specific arguments.
By default, Chonkit does not use any authentication mechanisms. This is fine for local deployments, but is problematic when chonkit is exposed to the outside world. The following is a list of supported authorization mechanisms.
Feature: auth-vault
Arg | Env | Description |
---|---|---|
--vault-url |
VAULT_URL |
The endpoint of the vault server. |
--vault-role-id |
VAULT_ROLE_ID |
Role ID for the application. Used to log in and obtain an access token. |
--vault-secret-id |
VAULT_SECRET_ID |
Secret ID for the application. Used to log in and obtain an access token. |
--vault-key-name |
VAULT_KEY_NAME |
Name of the key to use for verifying signatures. |
Chonkit can be configured to hook up to Hashicorp's Vault with approle authentication. If enabled, at the start of the application Chonkit will log in to the vault and middleware will be registered on all routes. The middleware will check for the existence of a token in the following request parameters:
- A cookie with the name
chonkit_access_token
(for web clients). If using this, the web frontend must be deployed on the same domain as Chonkit. Authorization
request header (Bearer) (for API clients).
The token is expected to be a valid JWT signed by Vault's
transit engine.
The JWT must contain the version of the key used to sign it, specified by the
version
claim.
The signature must have the vault:vN:
prefix stripped,
Chonkit will add it when verifying using the version
claim.
If the signature is valid, additional claims are checked to ensure the validity of the token ( expiration, audience, etc.). Specifically, it checks for the following claims:
aud == chonkit
exp > now
To summarize:
-
An authorization server, i.e. an endpoint that generates JWTs intended to be used by Chonkit is set up on the same Vault as Chonkit.
-
An application that intends to use Chonkit obtains the access token.
-
The authorization server uses the sign endpoint to generate a signature for a JWT payload and constructs the JWT with it.
-
Chonkit uses the verify endpoint to verify the token signature on the same Vault mount the data was signed.
OpenAPI documentation is available at any chonkit instance at http://your-address/swagger-ui
.
This repository contains Chonkit, a part of Ragu, covered under the Apache License 2.0, except where noted (any Ragu logos or trademarks are not covered under the Apache License, and should be explicitly noted by a LICENSE file.)
Chonkit, a part of Ragu, is a product produced from this open source software, exclusively by Barrage d.o.o. It is distributed under our commercial terms.
Others are allowed to make their own distribution of the software, but they cannot use any of the Ragu trademarks, cloud services, etc.
We explicitly grant permission for you to make a build that includes our trademarks while developing Ragu itself. You may not publish or share the build, and you may not use that build to run Ragu for any other purpose.