Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specs: update vector database vep with more explanation #3054

Merged
merged 1 commit into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion specs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ communicate and coordinate on new efforts for the Versatile Data Kit project.

1. It is good idea to socialize an idea with some other contributors of VDK first. You can send your idea to VDK mailing list, slack, mail, etc.
2. Follow the process outlined in the [VEP template](NNNN-template/README.md)

3. It is strongly recommended to build the VEP incrementally (with multiple PRs). As you research and discover new things the proposal would naturally evolve.
It is very good idea the first PR should focus only on describing the motivation and the goals and so there's agreement and alignment on those before continuing.

## Credits

Expand Down
102 changes: 78 additions & 24 deletions specs/vep-milestone-25-vector-database-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,18 @@ However we might find that it really helps to include the page title along with

## Motivation

Every ML project has some form of data preprocessor that is used to prepare data for input into a model.
This preprocessor is typically created from scratch for each project.
That creates inconsistency and fails to take advantage of prior work.
This means every project needs:
* Data collection: Figure out how to connect to different sources, extract data, parse it, model it.
* Data cleaning: Do all similar data clean up for that source (all companies tool like confluence, likely 80% of data clean up on confluence data would be similar)
* Ensuring data freshness: track changes incrementally (if volume is big)
* Ensuring Data Privacy and Security - implementing measures to protect sensitive data.
* Setup a data pipeline that chunks appropriately and create embeddings

In short there's a need for standardized, modular preprocessing tools that can be easily adapted across different projects.

#### Example problem scenario:
A company has a powerful private LLM chatbot.
However they want it to be able to answer questions using the latest version of confluence docs/jira tickets etc...
Expand All @@ -56,24 +68,74 @@ We will make requests to the API to create embeddings for us.
After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases.

#### Benefits to customers:
They will be able to follow our template to quickly create similar jobs. This gives customers the ability to get up and runnning quickly.
They will be able to follow our template to quickly create similar jobs.
This gives customers the ability to get up and runnning quickly.

## Requirements and goals
1. There should be a single pipelines which given jira/confluence credentials can scrape the source
2. it should chunk up the information, embed it and then save it
1. Provide a template-based pipeline construction that is customizable at any stage.
2. It should chunk up the information, embed it and then save it
3. The systems should be easily configurable
1. Read from different sources
2. Different chunks sizes
3. Different embedders
4. Extra columns saved in database
4. There should be an example on how to build your own ingestion pipeline
1. Read from different sources chosen by user from a catlaog of configurable data sources
2. facilitate incremental updates and changes tracking
3. Different chunks sizes and strategies
4. Different embedders
5. Extra columns saved in database
4. There should be an example on how to build your own ingestion pipeline or customize further this one.
5. Only scraping new data and removing old data must be supported



### Non goals
1. This only populates information into a database that could be used by as RAG system. We don't handle stuffing the actual prompts.

## High-level design

This is where VDK fits in a ML data pipeline.
The goal is to automate and accelerate the ETL phase of ML pipeline as shown in below diagram.

![pipeline_diagram.png](pipeline_diagram.png)

### Main components

#### Catalog of Data Sources:

https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/vdk-data-sources already provides a catalog of pre-built data source plugins that developers can select from.
Each data source plugin will come with built-in logic for data extraction and information on incremental updates.

#### Catalog of Data Targets
Already provide with VDK through the [Ingestion Plugins](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/plugin/plugin_input.py#L230)
One limitation that need to be addressed is that hte Ingestion plugins do not support more than one configuration per method.
In another words only 1 postgres destination instance or 1 oracle destination instance per data job.

#### Customizable Pipeline Template:
A dynamic template that allows developers to specify data sources, embedding APIs, and the vector database.
The template will automatically generate the pipeline, including logic for tracking changes, chunking the data, and generating embeddings.

#### Extensibility:
Ability for developers to further customize the pipeline by editing or replacing any of the components within the template.
Documentation and examples on how to make such customizations effectively.

#### Automatic data changes detection and updates

The framework will incorporate an intelligent change detection system that monitors data sources for any additions, deletions, or modifications.
It does so by keeping state after each ingested payload is processed.

#### Chunking strategies

Options for developers to define custom chunking parameters, including chunk size and chunking criteria, to suit the specific needs of their applications

User should be able to reuse chunking logic defined in LangChain .
For example different text splitter in https://js.langchain.com/docs/modules/data_connection/document_transformers/

#### Testing and evaluation

Each pipeline can be tested in "unit-testable" (pytest) way using [vdk-test-utils](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/vdk-test-utils) library
Evaluation feature could be added in the future to evaluate the data stored in the Vector Database against Q/A set.

#### Example

Using confluence example the activities done by VDK look like:

![sequence_diagram.png](sequence_diagram.png)

## API design
Expand All @@ -90,16 +152,6 @@ The table should have a structure like

### Python code

I think the python code could look something like this.
In it we:
1. delete any files that have been removed since the last scrape
2. Then in a transaction delete all information for a page and in the same transaction write all the new information page
3. The embedding api is abstracted into it own class allowing users to easily provide their own embedding API

For user it may look like that:



```python
# Initialize the Confluence to Vector Database pipeline
pipeline = vdk.ToVectorPipeline(ConfluenceReader(credentials), PostgresInstance, MyEmbeddingApi())
Expand All @@ -108,11 +160,14 @@ pipeline = vdk.ToVectorPipeline(ConfluenceReader(credentials), PostgresInstance,
pipeline.update_vector_database(last_timestamp)
```

The goal is to simplify the entire process into two lines of code for typical use cases, covering data extraction, chunking, embedding creation, and saving in the DB.
- Incorporates incremental updates with deduplication and deletion
- Regular updates to the Vector DB with the latest content.
- Automate data extraction, chunking, embedding, and DB storage.
- Provide defaults for easy quick start, with customization options for complex needs
## Detailed design

The python code could look something like this.
In it we:
1. delete any files that have been removed since the last scrape
2. Then in a transaction delete all information for a page and in the same transaction write all the new information page
3. The embedding api is abstracted into it own class allowing users to easily provide their own embedding API


Internally the ToVectorPipeline may do something like that (very simplified for bringing more clarity purpose):
```python
Expand All @@ -124,7 +179,6 @@ with postgres_start_transaction as transaction:
PostgresInstance.from_documents(documents, MyEmbeddingApi())
```

## Detailed design
<!--
Dig deeper into each component. The section can be as long or as short as necessary.
Consider at least the below topics but you do not need to cover those that are not applicable.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading