-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support oracle database as VectorStore - Fix:#764 #765
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto Pull Request Review from LlamaPReview
1. Overview
1.1 PR Summary
- Business value and requirements alignment: This PR introduces support for Oracle Database as a VectorStore in the
vanna-ai/vanna
project. This enhancement expands the compatibility of the project with different database systems, specifically leveraging Oracle Database 23ai's vector capabilities. - Key components modified: The primary modifications include updates to
README.md
,pyproject.toml
, and the addition of new filessrc/vanna/oracle/__init__.py
andsrc/vanna/oracle/oracle_vector.py
. - Impact assessment: The addition of Oracle Database support broadens the project's database integration capabilities, potentially attracting users who prefer Oracle Database. It introduces a new dependency on
oracledb
andchromadb
. - System dependencies and integration impacts: The changes involve integrating with Oracle Database 23ai's vector capabilities and ensuring consistent implementation compared to existing VectorStores. The PR also introduces a new dependency on
oracledb
and an unexpected dependency onchromadb
, which needs further investigation.
1.2 Architecture Changes
- System design modifications: The architectural change involves adding a new VectorStore implementation for Oracle Database. This includes creating new tables and collections in Oracle Database to store embeddings and related metadata.
- Component interactions: The
Oracle_VectorStore
class interacts with Oracle Database using theoracledb
library. It manages connections, creates necessary tables and collections, and performs vector operations. - Integration points: The integration points include the Oracle Database connection setup, vector embedding generation, and CRUD operations on the embeddings stored in Oracle Database.
2. Detailed Technical Analysis
2.1 Code Logic Deep-Dive
Core Logic Changes
- src/vanna/oracle/oracle_vector.py - init
- Submitted PR Code:
from chromadb.utils import embedding_functions
default_ef = embedding_functions.DefaultEmbeddingFunction()
class Oracle_VectorStore(VannaBase):
def __init__(self, config=None):
VannaBase.__init__(self, config=config)
if config is not None:
self.embedding_function = config.get(
"embedding_function",
default_ef
)
self.pre_delete_collection = config.get("pre_delete_collection",
False)
self.cmetadata = config.get("cmetadata", {"created_by": "oracle"})
else:
self.embedding_function = default_ef
self.pre_delete_collection = False
self.cmetadata = {"created_by": "oracle"}
self.oracle_conn = oracledb.connect(dsn=config.get("dsn"))
self.oracle_conn.call_timeout = 30000
self.documentation_collection = "documentation"
self.ddl_collection = "ddl"
self.sql_collection = "sql"
self.n_results = config.get("n_results", 10)
self.n_results_ddl = config.get("n_results_ddl", self.n_results)
self.n_results_sql = config.get("n_results_sql", self.n_results)
self.n_results_documentation = config.get("n_results_documentation",
self.n_results)
self.create_tables_if_not_exists()
self.create_collections_if_not_exists(self.documentation_collection)
self.create_collections_if_not_exists(self.ddl_collection)
self.create_collections_if_not_exists(self.sql_collection)
- Analysis:
- The code initializes
Oracle_VectorStore
, inheriting fromVannaBase
. It useschromadb.utils.embedding_functions.DefaultEmbeddingFunction
as the default embedding function. This explains the unexpected dependency onchromadb
mentioned in the initial review. - The
chromadb
dependency is solely for the default embedding function. This introduces an unnecessary external dependency if the user intends to use other embedding functions or if a simpler default is sufficient. - The constructor sets up Oracle connection using
oracledb.connect
and configures timeouts and collection names. - It initializes
n_results
related parameters, which control the number of results returned by search queries. It's good to see these are configurable. - It calls methods to create tables and default collections if they don't exist. This is a standard setup procedure for vector stores.
- The code initializes
- LlamaPReview Suggested Improvements:
import json
import uuid
from typing import List, Optional, Tuple
import oracledb
import pandas as pd
from ..base import VannaBase
class DefaultEmbeddingFunction: # LlamaPReview: Define a simple default embedding function to remove chromadb dependency
def __call__(self, texts):
# In a real scenario, this would use an actual embedding model.
# For now, return a placeholder embedding (e.g., based on text length).
return [[float(len(text))] * 128 for text in texts] # Placeholder embedding of size 128
default_ef = DefaultEmbeddingFunction()
class Oracle_VectorStore(VannaBase):
def __init__(self, config=None):
VannaBase.__init__(self, config=config)
if config is not None:
self.embedding_function = config.get(
"embedding_function",
default_ef
)
self.pre_delete_collection = config.get("pre_delete_collection",
False)
self.cmetadata = config.get("cmetadata", {"created_by": "oracle"})
else:
self.embedding_function = default_ef
self.pre_delete_collection = False
self.cmetadata = {"created_by": "oracle"}
self.oracle_conn = oracledb.connect(dsn=config.get("dsn"))
self.oracle_conn.call_timeout = 30000
self.documentation_collection = "documentation"
self.ddl_collection = "ddl"
self.sql_collection = "sql"
self.n_results = config.get("n_results", 10)
self.n_results_ddl = config.get("n_results_ddl", self.n_results)
self.n_results_sql = config.get("n_results_sql", self.n_results)
self.n_results_documentation = config.get("n_results_documentation",
self.n_results)
self.create_tables_if_not_exists()
self.create_collections_if_not_exists(self.documentation_collection)
self.create_collections_if_not_exists(self.ddl_collection)
self.create_collections_if_not_exists(self.sql_collection)
- Improvement rationale:
- Technical benefits: Removes the unnecessary
chromadb
dependency, simplifying the project and reducing potential dependency conflicts. Introduces a lightweight, self-contained default embedding function. - Business value: Reduces the complexity of setup and deployment. Users who don't need
chromadb
's default embedding function won't have to install it. - Risk assessment: Lowers dependency risk. The placeholder embedding function is a starting point and can be replaced with a more sophisticated default or user-provided function as needed.
- Technical benefits: Removes the unnecessary
Core Logic Changes
- src/vanna/oracle/oracle_vector.py - search_tables_metadata
- Submitted PR Code:
def search_tables_metadata(self,
engine: str = None,
catalog: str = None,
schema: str = None,
table_name: str = None,
ddl: str = None,
size: int = 10,
**kwargs) -> list:
pass
- Analysis:
- The
search_tables_metadata
method is implemented withpass
, meaning it does nothing. The PR description mentions that@abstractmethod search_tables_metadata in VannaBase is not implemented
and suggests a potential bug inopensearch_vector.py
. - This implementation confirms that this method is indeed not implemented in
Oracle_VectorStore
. It's crucial to understand if this is intentional, a temporary placeholder, or an oversight. - The initial review correctly points out the need to analyze this method and its relation to the potential bug mentioned in the PR description.
- If
search_tables_metadata
is intended to be part of theVannaBase
contract and is required for the functionality ofvanna-ai/vanna
, then thispass
implementation is incomplete and needs to be addressed.
- The
- LlamaPReview Suggested Improvements: [Skip this section as the suggested improvement depends on the intended functionality of
search_tables_metadata
which is not clear from the PR. Further investigation is needed to understand the expected behavior of this method inVannaBase
and implement it accordingly for Oracle.] - Improvement rationale:
- Technical benefits: Addresses the incomplete implementation of
search_tables_metadata
. Fulfills the interface contract defined byVannaBase
(if intended). - Business value: Enables the functionality related to searching table metadata, which might be crucial for the overall features of
vanna-ai/vanna
. - Risk assessment: High risk if
search_tables_metadata
is a required method. Leaving it aspass
will lead to unexpected behavior or errors when this method is called. Further investigation is needed to determine the correct implementation.
- Technical benefits: Addresses the incomplete implementation of
Core Logic Changes
- src/vanna/oracle/oracle_vector.py - get_related_ddl
- Submitted PR Code:
def get_related_ddl(self, question: str, **kwargs) -> list:
collection = self.get_collection(self.ddl_collection)
cursor = self.oracle_conn.cursor()
cursor.setinputsizes(None, oracledb.DB_TYPE_VECTOR)
cursor.execute(
"""
SELECT
document
FROM oracle_embedding
WHERE
collection_id = :1
ORDER BY VECTOR_DISTANCE(embedding, TO_VECTOR(:2), COSINE)
FETCH FIRST :top_k ROWS ONLY
""", [
collection["uuid"],
self.generate_embedding(question),
100
]
)
results = cursor.fetchall()
self.oracle_conn.commit()
cursor.close()
return Oracle_VectorStore._extract_documents(results)
- Analysis:
- In
get_related_ddl
, theFETCH FIRST
clause uses a hardcoded value100
instead of using the configuredn_results_ddl
orn_results
. This inconsistency can lead to unexpected behavior and makes the number of returned results less configurable than intended. - Similar hardcoding is also present in
get_related_documentation
. - The initial review did not explicitly mention this hardcoding issue.
- In
- LlamaPReview Suggested Improvements:
def get_related_ddl(self, question: str, **kwargs) -> list:
collection = self.get_collection(self.ddl_collection)
cursor = self.oracle_conn.cursor()
cursor.setinputsizes(None, oracledb.DB_TYPE_VECTOR)
cursor.execute(
"""
SELECT
document
FROM oracle_embedding
WHERE
collection_id = :1
ORDER BY VECTOR_DISTANCE(embedding, TO_VECTOR(:2), COSINE)
FETCH FIRST :top_k ROWS ONLY
""", [
collection["uuid"],
self.generate_embedding(question),
self.n_results_ddl # LlamaPReview: Use n_results_ddl instead of hardcoded 100
]
)
results = cursor.fetchall()
self.oracle_conn.commit()
cursor.close()
return Oracle_VectorStore._extract_documents(results)
- Improvement rationale:
- Technical benefits: Fixes inconsistency by using the configurable
n_results_ddl
parameter, making the behavior consistent with other similar methods and respecting user configurations. - Business value: Provides expected configurability for users who want to control the number of results returned by DDL related queries.
- Risk assessment: Low risk. Ensures consistent behavior and better configurability.
- Technical benefits: Fixes inconsistency by using the configurable
Core Logic Changes
- src/vanna/oracle/oracle_vector.py - update_training_data
- Submitted PR Code:
def update_training_data(self, id: str, train_type: str, question: str,
**kwargs) -> bool:
print(f"{train_type=}")
update_content = kwargs["content"]
if train_type == 'sql':
update_json = json.dumps(
{
"question": question,
"sql": update_content,
}
)
elif train_type == 'ddl':
update_json = json.dumps(
{
"question": None,
"ddl": update_content,
}
)
elif train_type == 'documentation':
update_json = json.dumps(
{
"question": None,
"documentation": update_content,
}
)
else:
update_json = json.dumps(
{
"question": question,
"sql": update_content,
}
)
cursor = self.oracle_conn.cursor()
cursor.setinputsizes(oracledb.DB_TYPE_VECTOR, oracledb.DB_TYPE_JSON)
cursor.execute(
"""
UPDATE
oracle_embedding
SET
embedding = TO_VECTOR(:1),
document = JSON_MERGEPATCH(document, :2)
WHERE
uuid = :3
""", [
self.generate_embedding(update_content),
update_json,
id
]
)
self.oracle_conn.commit()
cursor.close()
return True
- Analysis:
- In
update_training_data
, the logic for constructingupdate_json
is somewhat repetitive. Theelse
condition duplicates the 'sql' case. This could be simplified. - More importantly, the code uses
JSON_MERGEPATCH
to update thedocument
column. It's essential to understand the behavior ofJSON_MERGEPATCH
in OracleDB, especially regarding how it handles nested JSON structures and potential data loss or corruption if the existingdocument
structure is not as expected. It's crucial to ensure thatJSON_MERGEPATCH
is the correct function for the intended update semantics. If the goal is to replace the entire document content based ontrain_type
andupdate_content
, thenJSON_MERGEPATCH
might be unnecessarily complex and potentially error-prone compared to simply replacing the entiredocument
JSON. - The initial review did not delve into the specifics of
JSON_MERGEPATCH
usage.
- In
- LlamaPReview Suggested Improvements:
def update_training_data(self, id: str, train_type: str, question: str,
**kwargs) -> bool:
print(f"{train_type=}")
update_content = kwargs["content"]
if train_type == 'sql':
update_json = json.dumps(
{
"question": question,
"sql": update_content,
}
)
elif train_type == 'ddl':
update_json = json.dumps(
{
"question": None,
"ddl": update_content,
}
)
elif train_type == 'documentation':
update_json = json.dumps(
{
"question": None,
"documentation": update_content,
}
)
else: # LlamaPReview: Consider raising an exception for unknown train_type instead of defaulting to 'sql'
raise ValueError(f"Unknown train_type: {train_type}") # LlamaPReview: Raise ValueError for unknown train_type
cursor = self.oracle_conn.cursor()
cursor.setinputsizes(oracledb.DB_TYPE_VECTOR, oracledb.DB_TYPE_JSON)
cursor.execute(
"""
UPDATE
oracle_embedding
SET
embedding = TO_VECTOR(:1),
document = :2 # LlamaPReview: Replace entire document instead of using JSON_MERGEPATCH for simplicity and clarity
WHERE
uuid = :3
""", [
self.generate_embedding(update_content),
update_json, # LlamaPReview: Use update_json directly to replace the document
id
]
)
self.oracle_conn.commit()
cursor.close()
return True
- Improvement rationale:
- Technical benefits: Simplifies the
UPDATE
query by replacing the entiredocument
JSON, making the update logic clearer and potentially more efficient. Removes the ambiguity and potential complexity ofJSON_MERGEPATCH
. Adds error handling for unknowntrain_type
values, improving robustness. - Business value: Reduces potential data corruption risks associated with incorrect
JSON_MERGEPATCH
usage. Makes the code easier to understand and maintain. - Risk assessment: Low risk. Replacing the entire document is a safer and more predictable operation in this context, assuming the intention is to update the content based on
train_type
. Raising an error for unknowntrain_type
prevents unexpected behavior.
- Technical benefits: Simplifies the
2.2 Implementation Quality
- Code organization and structure: The code is well-organized, with clear separation of concerns. The
Oracle_VectorStore
class is well-structured, and methods are logically grouped. - Design patterns usage: The class follows the design patterns established in
VannaBase
, ensuring consistency with other VectorStore implementations. - Error handling approach: The code uses try-except blocks where necessary, and parameterized queries are used to prevent SQL injection risks. However, more robust error handling could be added, especially for database connection issues.
- Resource management: The code properly manages database connections and cursors, ensuring they are closed after use. However, more explicit error handling for resource management would improve robustness.
3. Critical Findings
3.1 Potential Issues
-
🔴 Critical Issues
- Incomplete Implementation of
search_tables_metadata
: Thesearch_tables_metadata
method is implemented withpass
, meaning it does nothing. This is a critical issue if this method is part of theVannaBase
contract and is required for the functionality ofvanna-ai/vanna
.- Impact: Leaving this method as
pass
will lead to unexpected behavior or errors when this method is called. - Recommendation: Investigate the intended functionality of
search_tables_metadata
inVannaBase
and implement it accordingly for Oracle.
- Impact: Leaving this method as
- Incomplete Implementation of
-
🟡 Warnings
- Unnecessary Dependency on
chromadb
: Thechromadb
dependency is solely for the default embedding function. This introduces an unnecessary external dependency if the user intends to use other embedding functions or if a simpler default is sufficient.- Potential risks: Increases the complexity of setup and deployment. Users who don't need
chromadb
's default embedding function will have to install it. - Suggested improvements: Remove the
chromadb
dependency and introduce a lightweight, self-contained default embedding function.
- Potential risks: Increases the complexity of setup and deployment. Users who don't need
- Hardcoded Values in
get_related_ddl
andget_related_documentation
: TheFETCH FIRST
clause uses a hardcoded value100
instead of using the configuredn_results_ddl
orn_results
. This inconsistency can lead to unexpected behavior and makes the number of returned results less configurable than intended.- Potential risks: Inconsistent behavior and reduced configurability.
- Suggested improvements: Use the configurable
n_results_ddl
parameter inget_related_ddl
and similar parameters in other methods to ensure consistent behavior and better configurability.
- Unnecessary Dependency on
3.2 Code Quality Concerns
- Maintainability aspects: The code is generally maintainable, but the use of
JSON_MERGEPATCH
inupdate_training_data
introduces complexity and potential risks. Simplifying this logic would improve maintainability. - Readability issues: The code is readable, but some repetitive logic in
update_training_data
could be simplified for better readability. - Performance bottlenecks: The use of
JSON_MERGEPATCH
inupdate_training_data
could introduce performance bottlenecks, especially if the existingdocument
structure is not as expected. Replacing the entiredocument
JSON would be more efficient.
4. Security Assessment
- Authentication/Authorization impacts: The code does not directly handle authentication or authorization, but it is crucial to ensure secure database connection practices are implemented, especially regarding credential management.
- Data handling concerns: The code handles embedding data and related metadata. It is essential to ensure that this data is handled securely and that sensitive information is not exposed.
- Input validation: The code uses parameterized queries, which helps prevent SQL injection risks. However, additional input validation could be added to ensure the robustness of the code.
- Security best practices: Follow security best practices for database connections, including secure credential management and input validation.
- Potential security risks: The primary security risk is related to database connection practices and credential management. Ensure that these are handled securely.
- Mitigation strategies: Implement secure credential management practices and ensure that database connections are handled securely.
- Security testing requirements: Conduct security testing to ensure that the code handles database connections and data securely.
5. Testing Strategy
5.1 Test Coverage
- Unit test analysis: Ensure that unit tests are implemented for all methods in
Oracle_VectorStore
. This includes testing the initialization, embedding generation, and CRUD operations on the embeddings stored in Oracle Database. - Integration test requirements: Implement integration tests with a live Oracle 23ai database instance to ensure real-world compatibility and functionality.
- Edge cases coverage: Ensure that edge cases are covered in the tests, including handling of different
train_type
values inupdate_training_data
and different configurations forn_results
parameters.
5.2 Test Recommendations
Suggested Test Cases
# Sample test code for update_training_data
def test_update_training_data_sql(self):
store = Oracle_VectorStore(config={"dsn": "your_dsn"})
id = store.add_question_sql("test question", "test sql")
store.update_training_data(id, "sql", "test question", content="updated sql")
result = store.get_similar_question_sql("test question")
assert result[0]["sql"] == "updated sql"
def test_update_training_data_ddl(self):
store = Oracle_VectorStore(config={"dsn": "your_dsn"})
id = store.add_ddl("test ddl")
store.update_training_data(id, "ddl", None, content="updated ddl")
result = store.get_related_ddl("test question")
assert result[0]["ddl"] == "updated ddl"
- Coverage improvements: Ensure that all methods and edge cases are covered in the tests.
- Performance testing needs: Conduct performance testing to ensure that the code handles large datasets efficiently.
6. Documentation & Maintenance
- Documentation updates needed (API, architecture, configuration): Update the documentation to include the new
Oracle_VectorStore
class and its methods. Ensure that the configuration options are clearly documented. - Long-term maintenance considerations: Ensure that the code is maintainable and that any dependencies are well-documented. Remove unnecessary dependencies to simplify maintenance.
- Technical debt and monitoring requirements: Monitor the performance and security of the code. Address any technical debt related to the use of
JSON_MERGEPATCH
and hardcoded values.
7. Deployment & Operations
- Deployment impact and strategy: Ensure that the deployment process includes the installation of the
oracledb
dependency. Remove thechromadb
dependency to simplify deployment. - Key operational considerations: Monitor the performance and security of the database connections. Ensure that the database is properly configured to handle the vector operations.
8. Summary & Recommendations
8.1 Key Action Items
- Investigate and implement
search_tables_metadata
: Determine the intended functionality ofsearch_tables_metadata
inVannaBase
and implement it accordingly for Oracle. - Remove
chromadb
dependency: Introduce a lightweight, self-contained default embedding function to remove the unnecessarychromadb
dependency. - Use configurable parameters in
get_related_ddl
and similar methods: Replace hardcoded values with configurable parameters to ensure consistent behavior and better configurability. - Simplify
update_training_data
: Replace the use ofJSON_MERGEPATCH
with a simpler approach to update the entiredocument
JSON, improving maintainability and performance.
8.2 Future Considerations
- Technical evolution path: Continuously improve the implementation of
Oracle_VectorStore
based on feedback and testing results. - Business capability evolution: Expand the capabilities of
vanna-ai/vanna
to support more database systems and improve the overall functionality. - System integration impacts: Ensure that the integration with Oracle Database is robust and secure, and that it meets the needs of the users.
💡 Help Shape LlamaPReview
How's this review format working for you? Vote in our Github Discussion Polls to help us improve your review experience!
zainhoda
approved these changes
Feb 8, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support oracle database as VectorStore - Fix:#764
oracle database 23ai is supporting vector type, so add support for oracle database.
@AbstractMethod search_tables_metadata in VannaBase is not implemented.
I think this maybe a bug from opensearch_vector.py, all other vector stores should be error now.