diff --git a/specs/README.md b/specs/README.md index dcfecc2e5f..b46dd099f9 100644 --- a/specs/README.md +++ b/specs/README.md @@ -16,7 +16,8 @@ communicate and coordinate on new efforts for the Versatile Data Kit project. 1. It is good idea to socialize an idea with some other contributors of VDK first. You can send your idea to VDK mailing list, slack, mail, etc. 2. Follow the process outlined in the [VEP template](NNNN-template/README.md) - +3. It is strongly recommended to build the VEP incrementally (with multiple PRs). As you research and discover new things the proposal would naturally evolve. + It is very good idea the first PR should focus only on describing the motivation and the goals and so there's agreement and alignment on those before continuing. ## Credits diff --git a/specs/vep-milestone-25-vector-database-ingestion/README.md b/specs/vep-milestone-25-vector-database-ingestion/README.md index d81230ca00..01630e9a81 100644 --- a/specs/vep-milestone-25-vector-database-ingestion/README.md +++ b/specs/vep-milestone-25-vector-database-ingestion/README.md @@ -35,6 +35,18 @@ However we might find that it really helps to include the page title along with ## Motivation +Every ML project has some form of data preprocessor that is used to prepare data for input into a model. +This preprocessor is typically created from scratch for each project. +That creates inconsistency and fails to take advantage of prior work. +This means every project needs: +* Data collection: Figure out how to connect to different sources, extract data, parse it, model it. +* Data cleaning: Do all similar data clean up for that source (all companies tool like confluence, likely 80% of data clean up on confluence data would be similar) +* Ensuring data freshness: track changes incrementally (if volume is big) +* Ensuring Data Privacy and Security - implementing measures to protect sensitive data. +* Setup a data pipeline that chunks appropriately and create embeddings + +In short there's a need for standardized, modular preprocessing tools that can be easily adapted across different projects. + #### Example problem scenario: A company has a powerful private LLM chatbot. However they want it to be able to answer questions using the latest version of confluence docs/jira tickets etc... @@ -56,24 +68,74 @@ We will make requests to the API to create embeddings for us. After this datajob is running we will create a template from this in which we think customers will be able to adopt to meet their use cases. #### Benefits to customers: -They will be able to follow our template to quickly create similar jobs. This gives customers the ability to get up and runnning quickly. +They will be able to follow our template to quickly create similar jobs. +This gives customers the ability to get up and runnning quickly. ## Requirements and goals -1. There should be a single pipelines which given jira/confluence credentials can scrape the source -2. it should chunk up the information, embed it and then save it +1. Provide a template-based pipeline construction that is customizable at any stage. +2. It should chunk up the information, embed it and then save it 3. The systems should be easily configurable - 1. Read from different sources - 2. Different chunks sizes - 3. Different embedders - 4. Extra columns saved in database -4. There should be an example on how to build your own ingestion pipeline + 1. Read from different sources chosen by user from a catlaog of configurable data sources + 2. facilitate incremental updates and changes tracking + 3. Different chunks sizes and strategies + 4. Different embedders + 5. Extra columns saved in database +4. There should be an example on how to build your own ingestion pipeline or customize further this one. 5. Only scraping new data and removing old data must be supported + ### Non goals 1. This only populates information into a database that could be used by as RAG system. We don't handle stuffing the actual prompts. ## High-level design + +This is where VDK fits in a ML data pipeline. +The goal is to automate and accelerate the ETL phase of ML pipeline as shown in below diagram. + +![pipeline_diagram.png](pipeline_diagram.png) + +### Main components + +#### Catalog of Data Sources: + +https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/vdk-data-sources already provides a catalog of pre-built data source plugins that developers can select from. +Each data source plugin will come with built-in logic for data extraction and information on incremental updates. + +#### Catalog of Data Targets +Already provide with VDK through the [Ingestion Plugins](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/api/plugin/plugin_input.py#L230) +One limitation that need to be addressed is that hte Ingestion plugins do not support more than one configuration per method. +In another words only 1 postgres destination instance or 1 oracle destination instance per data job. + +#### Customizable Pipeline Template: +A dynamic template that allows developers to specify data sources, embedding APIs, and the vector database. +The template will automatically generate the pipeline, including logic for tracking changes, chunking the data, and generating embeddings. + +#### Extensibility: +Ability for developers to further customize the pipeline by editing or replacing any of the components within the template. +Documentation and examples on how to make such customizations effectively. + +#### Automatic data changes detection and updates + +The framework will incorporate an intelligent change detection system that monitors data sources for any additions, deletions, or modifications. +It does so by keeping state after each ingested payload is processed. + +#### Chunking strategies + +Options for developers to define custom chunking parameters, including chunk size and chunking criteria, to suit the specific needs of their applications + +User should be able to reuse chunking logic defined in LangChain . +For example different text splitter in https://js.langchain.com/docs/modules/data_connection/document_transformers/ + +#### Testing and evaluation + +Each pipeline can be tested in "unit-testable" (pytest) way using [vdk-test-utils](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/vdk-test-utils) library +Evaluation feature could be added in the future to evaluate the data stored in the Vector Database against Q/A set. + +#### Example + +Using confluence example the activities done by VDK look like: + ![sequence_diagram.png](sequence_diagram.png) ## API design @@ -90,16 +152,6 @@ The table should have a structure like ### Python code -I think the python code could look something like this. -In it we: -1. delete any files that have been removed since the last scrape -2. Then in a transaction delete all information for a page and in the same transaction write all the new information page -3. The embedding api is abstracted into it own class allowing users to easily provide their own embedding API - -For user it may look like that: - - - ```python # Initialize the Confluence to Vector Database pipeline pipeline = vdk.ToVectorPipeline(ConfluenceReader(credentials), PostgresInstance, MyEmbeddingApi()) @@ -108,11 +160,14 @@ pipeline = vdk.ToVectorPipeline(ConfluenceReader(credentials), PostgresInstance, pipeline.update_vector_database(last_timestamp) ``` -The goal is to simplify the entire process into two lines of code for typical use cases, covering data extraction, chunking, embedding creation, and saving in the DB. -- Incorporates incremental updates with deduplication and deletion -- Regular updates to the Vector DB with the latest content. -- Automate data extraction, chunking, embedding, and DB storage. -- Provide defaults for easy quick start, with customization options for complex needs +## Detailed design + +The python code could look something like this. +In it we: +1. delete any files that have been removed since the last scrape +2. Then in a transaction delete all information for a page and in the same transaction write all the new information page +3. The embedding api is abstracted into it own class allowing users to easily provide their own embedding API + Internally the ToVectorPipeline may do something like that (very simplified for bringing more clarity purpose): ```python @@ -124,7 +179,6 @@ with postgres_start_transaction as transaction: PostgresInstance.from_documents(documents, MyEmbeddingApi()) ``` -## Detailed design