Chonkit

Chunk documents.

General information

Chonkit is an application for chunking documents whose chunks can then be used for retrieval augmented generation (RAG).

RAG is a technique to provide LLMs contextual information about arbitrary data. The jist of RAG is the following:

User sends a prompt.
Prompt is used for semantic search to retrieve context from the knowledge base.
Context and prompt are sent to LLM, providing it the necessary information to answer the prompt accurately.

Chonkit focuses on problem 2.

Parsers

Documents come in many different shapes and sizes. A parser is responsible for turning its content into bytes (raw text) and forwarding them to the chunkers. Parsers can be configured to read only a specific range from the document, and they can be configured to skip arbitrary text elements.

Chonkit provides an API to configure parsers for fast iteration.

Chunkers

Embedding and retrieving whole documents is unfeasible as they can be massive, so we need some way to split them up into smaller parts, but still retain information clarity.

Chonkit currently offers 3 flavors of chunkers:

SlidingWindow - the simplest (and worst performing) chunking implementation.
SnappingWindow - a better heuristic chunker that retains sentence stops.
SemanticWindow - an experimental chunker that uses embeddings and their distances to determine chunk boundaries.

The optimal flavor depends on the document being chunked. There is no perfect chunking flavor and finding the best one will be a game of trial and error, which is why it is important to get fast feedback when chunking.

Chonkit provides APIs to configure how documents get chunked, as well as a preview API for fast iteration.

Vectors

Once the documents are chunked, we have to store them somehow. We do this by embedding them into vectors and storing them to a collection in a vector database. Vector databases are specialised software used for efficient storage of these vectors and their retrieval.

Chonkit provides APIs to manipulate vector collections and store embeddings into them.

Providers

Chonkit uses a modular architecture that allows for easy integration of new vector database, embedding, and document storage providers. This section lists the available providers and their corresponding feature flags.

Vector database providers

Provider	Feature	Description
Qdrant	`qdrant`	Enable qdrant as one of the vector database providers.
Weaviate	`weaviate`	Enable weaviate as one of the vector database providers.

Embedding providers

Provider	Feature	Description
OpenAI	`openai`	Enable OpenAI as one of the embedding providers.
Fastembed	`fe-local` / `fe-remote`	Enable Fastembed as one of the embedding providers. The local implementation uses the current machine to embed, the remote implementation uses a remote server and needs a URL to connect to. When running locally the `cuda` feature flag will enable CUDA support and will fallback to the CPU if a CUDA capable device is not found.

Document storage providers

Provider	Feature	Capabilities
Local	Always enabled.	read/write
Google Drive	`gdrive`	read

Local

Uses the machine's file system to store documents. Always enabled and cannot be disabled.

Google Drive

When enabled, allows files to be imported from Google Drive.

Google Drive only accepts tokens generated by OAuth clients, therefore you need to set one up with a Google project.

To use any of the routes for importing files, authorization with Google is required beforehand. A code exchange route is exposed for obtaining an access token with read permissions for Drive.

When accessing any of this provider's routes, the access token must either be in the google_drive_access_token cookie, or in the X-Google-Drive-Acess-Token header.

All files imported from Drive will be downloaded into the directory provided on application startup (see table below). This means changes from Drive will not be reflected in Chonkit unless manually refreshed. There is a route that lists all files imported from Drive and compares the local modification time with the current modification time of the file. If the external modification time is newer, the file will be re-downloaded.

Argument	Environment variable	Description	Default
`--google-drive-download-path`	`GOOGLE_DRIVE_DOWNLOAD_PATH`	The directory to download files to when importing from Drive.	`./upload/gdrive`
`--google-oauth-client-id`	`GOOGLE_OAUTH_CLIENT_ID`	The client ID of the OAuth client.	-
`--google-oauth-client-secret`	`GOOGLE_OAUTH_CLIENT_SECRET`	The client secret of the OAuth.	-

Binaries

This workspace consists the following binaries:

chonkit; exposes an HTTP API around chonkit's core functionality.
feserver; used to initiate fastembed with CUDA and expose an HTTP API for embeddings.

Building

Prerequisites

Pdfium

Chonkit depends on pdfium_render to parse PDFs. This library depends on libpdfium.so. In order for compilation to succeed, the library must be installed on the system. To download a version of libpdfium compatible with chonkit (6666), run the following (assuming Linux):

mkdir pdfium
wget https://github.com/bblanchon/pdfium-binaries/releases/download/chromium%2F6666/pdfium-linux-x64.tgz -O - | tar -xzvf - -C ./pdfium

The library can be found in ./pdfium/lib/libpdfium.so. In order to let cargo know of its existence, you have 2 options:

Set the LD_LIBRARY_PATH environment variable.
- By default, the GNU linker is set up to search for libraries in /usr/lib. If you copy the libpdfium.so into one of those directories, you do not need to need to set this variable. However, if you want to use the library from a different location, you need to tell the linker where it is:
```
export LD_LIBRARY_PATH=/path/to/dir/containing/pdfium:$LD_LIBRARY_PATH
```
  Note: You need to pass the directory that contains the libpdfium.so file, not the file itself. This command could also be placed in your .rc file.
Copy the libpdfium.so file to /usr/lib.

The latter is the preferred option as it is the least involved.

Fastembed

Required when compiling with fe-local.

Fastembed models require an onnxruntime. This library can be downloaded from here, or via the system's native package manager.

CUDA

Required when compiling with fe-local and cuda.

If using the cuda feature flag with fastembed, the system will need to have the CUDA toolkit installed. Fastembed, and in turn ort, will then use the CUDA execution provider for the onnxruntime. ort is designed to fail gracefully if it cannot register CUDA as one of the execution providers and the CPU provider will be used as fallback.

Additionally, if running feserver with Docker, these instructions need to be followed to enable GPUs in Docker.

Features

The following is a table of the supported build features.

Feature	Configuration	Description
`qdrant`	VectorDb provider	Enable qdrant as one of the vector database providers.
`weaviate`	VectorDb provider	Enable weaviate as one of the vector database providers.
`fe-local`	Embedder provider	Use the implementation of `Embedder` with `LocalFastEmbedder`. Mutually exclusive with `fe-remote`.
`fe-remote`	Embedder provider	Use the implementation of `Embedder` with `RemoteFastEmbedder`. Mutually exclusive with `fe-local`.
`openai`	Embedder provider	Enable openai as one of the embedding providers.
`cuda`	Execution provider	Available when using `fe-local`. When enabled, uses the CUDAExecutionProvider for the onnxruntime.
`gdrive`	Storage provider	Enable Google Drive as one of the document storage providers.

Full build command example

cargo build -F "qdrant weaviate fe-local" --release

Sqlx 'offline' compilation

By default, Chonkit uses sqlx with Postgres. During compilation, sqlx will use the DATABASE_URL environment variable to connect to the database. In order to prevent this default behaviour, run

cargo sqlx prepare --merged

This will cache the queries needed for 'offline' compilation. The cached queries are stored in the .sqlx directory and are checked into version control. You can check whether the build works by unsetting the DATABASE_URL environment variable.

unset DATABASE_URL

Local quickstart

cp .example.env .env
source setup.sh

Creates the 'data/upload' and 'data/gdrive' directories for storing documents. Starts the infrastructure containers (postgres, qdrant, weaviate). Exports the necessary environment variables to run chonkit.

Run

source setup.sh -h

to see all the available options for the setup script.

Running

Chonkit accepts the following arguments:

Arg	Env	Feature	Default	Description
`--db-url`	`DATABASE_URL`	*	-	The database URL.
`--log`	`RUST_LOG`	*	`info`	The `RUST_LOG` env filter string to use.
`--upload-path`	`UPLOAD_PATH`	*	`./upload`	If using the `FsDocumentStore`, sets its upload path.
`--address`	`ADDRESS`	*	`0.0.0.0:42069`	The address (host:port) to bind the server to.
`--cors-allowed-origins`	`CORS_ALLOWED_ORIGINS`	*	-	Comma separated list of origins allowed to connect.
`--cors-allowed-headers`	`CORS_ALLOWED_HEADERS`	*	-	Comma separated list of accepted headers.
`--cookie-domain`	`COOKIE_DOMAIN`	*	`localhost`	Which domain to set on cookies.
`--qdrant-url`	`QDRANT_URL`	`qdrant`	-	Qdrant vector database URL.
`--weaviate-url`	`WEAVIATE_URL`	`weaviate`	-	Weaviate vector database URL.
`--fembed-url`	`FEMBED_URL`	`fe-remote`	-	Remote fastembed URL.
-	`OPENAI_KEY`	`openai`	-	OpenAI API key.

The arguments have priority over the environment variables. See RUST_LOG syntax here. See Authorization for more information about authz specific arguments.

Authorization

By default, Chonkit does not use any authentication mechanisms. This is fine for local deployments, but is problematic when chonkit is exposed to the outside world. The following is a list of supported authorization mechanisms.

Vault JWT authorization

Feature: auth-vault

Required variables

Arg	Env	Description
`--vault-url`	`VAULT_URL`	The endpoint of the vault server.
`--vault-role-id`	`VAULT_ROLE_ID`	Role ID for the application. Used to log in and obtain an access token.
`--vault-secret-id`	`VAULT_SECRET_ID`	Secret ID for the application. Used to log in and obtain an access token.
`--vault-key-name`	`VAULT_KEY_NAME`	Name of the key to use for verifying signatures.

Description

Chonkit can be configured to hook up to Hashicorp's Vault with approle authentication. If enabled, at the start of the application Chonkit will log in to the vault and middleware will be registered on all routes. The middleware will check for the existence of a token in the following request parameters:

A cookie with the name chonkit_access_token (for web clients). If using this, the web frontend must be deployed on the same domain as Chonkit.
Authorization request header (Bearer) (for API clients).

The token is expected to be a valid JWT signed by Vault's transit engine. The JWT must contain the version of the key used to sign it, specified by the version claim.

The signature must have the vault:vN: prefix stripped, Chonkit will add it when verifying using the version claim.

If the signature is valid, additional claims are checked to ensure the validity of the token ( expiration, audience, etc.). Specifically, it checks for the following claims:

aud == chonkit
exp > now

To summarize:

An authorization server, i.e. an endpoint that generates JWTs intended to be used by Chonkit is set up on the same Vault as Chonkit.
An application that intends to use Chonkit obtains the access token.
The authorization server uses the sign endpoint to generate a signature for a JWT payload and constructs the JWT with it.
Chonkit uses the verify endpoint to verify the token signature on the same Vault mount the data was signed.

OpenAPI documentation

OpenAPI documentation is available at any chonkit instance at http://your-address/swagger-ui.

License

This repository contains Chonkit, a part of Ragu, covered under the Apache License 2.0, except where noted (any Ragu logos or trademarks are not covered under the Apache License, and should be explicitly noted by a LICENSE file.)

Chonkit, a part of Ragu, is a product produced from this open source software, exclusively by Barrage d.o.o. It is distributed under our commercial terms.

Others are allowed to make their own distribution of the software, but they cannot use any of the Ragu trademarks, cloud services, etc.

We explicitly grant permission for you to make a build that includes our trademarks while developing Ragu itself. You may not publish or share the build, and you may not use that build to run Ragu for any other purpose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Chonkit

Contents

General information

Parsers

Chunkers

Vectors

Providers

Vector database providers

Embedding providers

Document storage providers

Local

Google Drive

Binaries

Building

Prerequisites

Pdfium

Fastembed

CUDA

Features

Full build command example

Sqlx 'offline' compilation

Local quickstart

Running

Authorization

Vault JWT authorization

Required variables

Description

OpenAPI documentation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Chonkit

Contents

General information

Parsers

Chunkers

Vectors

Providers

Vector database providers

Embedding providers

Document storage providers

Local

Google Drive

Binaries

Building

Prerequisites

Pdfium

Fastembed

CUDA

Features

Full build command example

Sqlx 'offline' compilation

Local quickstart

Running

Authorization

Vault JWT authorization

Required variables

Description

OpenAPI documentation

License