Add WeaviateDocumentIngestOperator #36402
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Exposing
create_or_replace_document_objects
as an operator that handles a very common scenario of ingesting objects derived from unique documents and we have to keep up with the changes in documents.Example
If we have a document
https://en.wikipedia.org/wiki/Taj_Mahal
the entire document is converted to smaller chunks because LLM models have limitations on max data they can handle in a call.Assuming that the document is converted into two chunks
Chunk 1:
The Taj Mahal 'Crown of the Palace' is an ivory-white marble mausoleum on the right bank of the river Yamuna Agra Uttar Pradesh India. It was commissioned in 1631 by the fifth Mughal emperor, Shah Jahan to house the tomb of his beloved wife, Mumtaz Mahal it also houses the tomb of Shah Jahan himself.
Chunk 2:
The tomb is the centerpiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.
Changes:
For LLM models to answer the question correctly they need to have only updated information and that's why there is a requirement to keep only the latest set of chunks in the Database.
If now for example we later came to know that the Taj Mahal was actually commissioned in 1593 there are changes introduced in the document and there was a change in chunking/tokenizing strategy. Now we have a different set of chunks.
Chunk 1:
The Taj Mahal 'Crown of the Palace' is an ivory-white marble mausoleum on the right bank of the river Yamuna Agra Uttar Pradesh India.
Chunk 2:
It was commissioned in 1593 by the fifth Mughal emperor, Shah Jahan to house the tomb of his beloved wife, Mumtaz Mahal it also houses the tomb of Shah Jahan himself.
Chunk 3:
The tomb is the centerpiece of a 17-hectare (42-acre) complex, which includes a mosque and a guest house, and is set in formal gardens bounded on three sides by a crenellated wall.
With these new chunks, we have no way of knowing which exact chunk to replace because there can be multiple ways a document can be chunked/tokenized and it may result in splitting the document differently. So our best bet is to drop all the objects belonging to a document and re-create the document entirely.
WeaviateDocumentIngestOperator
handles these complexities operates at the document level and offers anexisting
param with possible values:replace
: replace the existing objects with new objects. This option requires to identify theobjects belonging to a document. which by default is done by using the document_column field.
skip
: skip the existing objects and only add the missing objects of a document.error
: raise an error if an object belonging to an existing document is tried to be created.