Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WeaviateDocumentIngestOperator #36402

Merged
merged 1 commit into from
Dec 24, 2023

Conversation

utkarsharma2
Copy link
Contributor

@utkarsharma2 utkarsharma2 commented Dec 24, 2023

Exposing create_or_replace_document_objects as an operator that handles a very common scenario of ingesting objects derived from unique documents and we have to keep up with the changes in documents.

Example

If we have a document https://en.wikipedia.org/wiki/Taj_Mahal the entire document is converted to smaller chunks because LLM models have limitations on max data they can handle in a call.

Assuming that the document is converted into two chunks

Chunk 1:

The Taj Mahal  'Crown of the Palace' is an ivory-white marble mausoleum on the right bank of the river Yamuna Agra Uttar Pradesh India. It was commissioned in 1631 by the fifth Mughal emperor, Shah Jahan to house the tomb of his beloved wife, Mumtaz Mahal it also houses the tomb of Shah Jahan himself.

Chunk 2:

The tomb is the centerpiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.

Changes:

For LLM models to answer the question correctly they need to have only updated information and that's why there is a requirement to keep only the latest set of chunks in the Database.

If now for example we later came to know that the Taj Mahal was actually commissioned in 1593 there are changes introduced in the document and there was a change in chunking/tokenizing strategy. Now we have a different set of chunks.

Chunk 1:

The Taj Mahal  'Crown of the Palace' is an ivory-white marble mausoleum on the right bank of the river Yamuna Agra Uttar Pradesh India.

Chunk 2:

It was commissioned in 1593 by the fifth Mughal emperor, Shah Jahan to house the tomb of his beloved wife, Mumtaz Mahal it also houses the tomb of Shah Jahan himself.

Chunk 3:

The tomb is the centerpiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.

With these new chunks, we have no way of knowing which exact chunk to replace because there can be multiple ways a document can be chunked/tokenized and it may result in splitting the document differently. So our best bet is to drop all the objects belonging to a document and re-create the document entirely.

WeaviateDocumentIngestOperator handles these complexities operates at the document level and offers an existing param with possible values:

  1. replace: replace the existing objects with new objects. This option requires to identify the
    objects belonging to a document. which by default is done by using the document_column field.
  2. skip: skip the existing objects and only add the missing objects of a document.
  3. error: raise an error if an object belonging to an existing document is tried to be created.

@potiuk potiuk merged commit 97d2266 into apache:main Dec 24, 2023
52 checks passed
@kaxil kaxil deleted the AddWeaviateDocuemntIngestOperator branch December 24, 2023 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants