Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] MongoDB Query Engine build on top of VectorDBQueryEngine #1092

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

sitloboi2012
Copy link
Collaborator

Why are these changes needed?

Follow-up discussion with @AgentGenie and @Eric-Shang in #983 related to the usage of VectorDBQueryEngine protocol

Related issue number

Closed #983 , #1004 , #688 , #941 , Open #950

Checks

@sitloboi2012 sitloboi2012 marked this pull request as draft February 22, 2025 15:10
logger.error("Failed to initialize the database: %s", e)
return False

def add_records( # type: ignore[no-untyped-def, override]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not override this method. You can either put chunk_size and chunck_overlap to init or put those in args/kwargs.

And why init_db method does not take these two parameters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah that's right, I think I over-engineer this part, let me fix that again, found an easier way to do this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think for the current approach, it does allow us more flexibility in adapting new document and transformation to get the best result, I think we could discuss more. So using the method in init_db func would be a go-to solution but lack of customizability while with the current previous approach (using SimpleDirectoryLoader + SentenceSplitter + input to Indexer) would allow us to have more room to customize

Copy link
Collaborator

@Eric-Shang Eric-Shang Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we don't need to override the add_records method. The current add_docs is not much different in loading documents from what init_db does, except allowing single url or path, right? I feel it's not necessary.

@AgentGenie
Copy link
Collaborator

For unit test, I think most unit tests in the project are using pytest. Please pytest in the future.

Copy link

codecov bot commented Feb 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ There is a different number of reports uploaded between BASE (1792140) and HEAD (9c661ff). Click for more details.

HEAD has 1286 uploads less than BASE
Flag BASE (1792140) HEAD (9c661ff)
3.10 99 0
ubuntu-latest 145 1
falkordb 2 0
3.11 66 1
3.13 88 0
lmm 4 0
teachable 4 0
core-llm 4 0
3.9 80 0
gemini 15 0
anthropic 16 0
core-without-llm 14 1
autobuild 1 0
macos-latest 108 0
integration 11 0
browser-use 6 0
3.12 39 0
neo4j 2 0
windows-latest 119 0
crawl4ai 13 0
deepseek 1 0
interop 13 0
agent-eval 1 0
retrievechat 15 0
retrievechat-qdrant 14 0
gpt-assistant-agent 3 0
openai 1 0
cerebras 14 0
optional-deps 141 0
retrievechat-couchbase 3 0
retrievechat-mongodb 10 0
retrievechat-pgvector 10 0
commsagent-discord 9 0
llama-index-agent 3 0
commsagent-slack 9 0
commsagent-telegram 9 0
long-context 3 0
jupyter-executor 9 0
websurfer 15 0
cohere 15 0
graph-rag-falkor-db 6 0
twilio 9 0
groq 14 0
mistral 14 0
ollama 14 0
together 14 0
bedrock 14 0
swarm 14 0
reasoning 14 0
docs 6 0
interop-langchain 9 0
interop-crewai 9 0
interop-pydantic-ai 9 0
websockets 9 0

see 62 files with indirect coverage changes

except Exception as e:
logger.error("Error inserting documents into the index: %s", e)

def query(self, question: str, llm: Union[str, "BaseLanguageModel"], *args: Any, **kwargs: Any) -> Any: # type: ignore[no-any-unimported, type-arg]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm should be in the constructor.
And use llama_index.core.llms.llm.LLM instead. We want to reduce the dependencies not add dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @AgentGenie I remove the BaseLanguageModel and replaced it with the LLM from from llama_index.llms.langchain.base import LLM

Copy link
Collaborator

@AgentGenie AgentGenie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the notebook which is not working.
Please

  1. add instructions on how to set up mongo db for testing in the notebook.
  2. make sure people could follow the instruction and run through the notebook.
  3. list extra dependencies with version. They should also be included in pyproject.toml under rag

…nnect_db to align with reviewer's comment and expectation, update the jupyter notebook again to align with the new mongodb_query_engine code, add on llm arg into the query function

Signed-off-by: sitloboi2012 <[email protected]>
…x-llms-langchain, update notebook for instruction usage

Signed-off-by: sitloboi2012 <[email protected]>
@sitloboi2012 sitloboi2012 force-pushed the feat/mongodb-query-engine branch from 9c661ff to a6efd9e Compare February 25, 2025 15:13
vector_db (MongoDBAtlasVectorDB): The MongoDB vector database instance.
vector_search_engine (MongoDBAtlasVectorSearch): The vector search engine.
storage_context (StorageContext): The storage context for the vector store.
indexer (Optional[VectorStoreIndex]): The index built from the documents.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index

self,
connection_string: str = "",
database_name: str = "vector_db",
embedding_function: Optional[Callable[..., Any]] = None,
Copy link
Collaborator

@Eric-Shang Eric-Shang Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to let users know what the default embedding function is.

def add_records(
self,
new_doc_dir: Optional[Union[str, Path]] = None,
new_doc_paths_or_urls: Optional[Union[List[Union[str, Path]], Union[str, Path]]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the protocol and remove singe string/path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants