Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support oracle database as VectorStore - Fix:#764 #765

Merged
merged 6 commits into from
Feb 8, 2025

Conversation

engchina
Copy link

@engchina engchina commented Feb 4, 2025

Support oracle database as VectorStore - Fix:#764

oracle database 23ai is supporting vector type, so add support for oracle database.

@AbstractMethod search_tables_metadata in VannaBase is not implemented.
I think this maybe a bug from opensearch_vector.py, all other vector stores should be error now.

Copy link

@llamapreview llamapreview bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 PR Summary

  • Business value and requirements alignment: This PR introduces support for Oracle Database as a VectorStore in the vanna-ai/vanna project. This enhancement expands the compatibility of the project with different database systems, specifically leveraging Oracle Database 23ai's vector capabilities.
  • Key components modified: The primary modifications include updates to README.md, pyproject.toml, and the addition of new files src/vanna/oracle/__init__.py and src/vanna/oracle/oracle_vector.py.
  • Impact assessment: The addition of Oracle Database support broadens the project's database integration capabilities, potentially attracting users who prefer Oracle Database. It introduces a new dependency on oracledb and chromadb.
  • System dependencies and integration impacts: The changes involve integrating with Oracle Database 23ai's vector capabilities and ensuring consistent implementation compared to existing VectorStores. The PR also introduces a new dependency on oracledb and an unexpected dependency on chromadb, which needs further investigation.

1.2 Architecture Changes

  • System design modifications: The architectural change involves adding a new VectorStore implementation for Oracle Database. This includes creating new tables and collections in Oracle Database to store embeddings and related metadata.
  • Component interactions: The Oracle_VectorStore class interacts with Oracle Database using the oracledb library. It manages connections, creates necessary tables and collections, and performs vector operations.
  • Integration points: The integration points include the Oracle Database connection setup, vector embedding generation, and CRUD operations on the embeddings stored in Oracle Database.

2. Detailed Technical Analysis

2.1 Code Logic Deep-Dive

Core Logic Changes

  • src/vanna/oracle/oracle_vector.py - init
    • Submitted PR Code:
    from chromadb.utils import embedding_functions

    default_ef = embedding_functions.DefaultEmbeddingFunction()

    class Oracle_VectorStore(VannaBase):
      def __init__(self, config=None):
        VannaBase.__init__(self, config=config)

        if config is not None:
          self.embedding_function = config.get(
            "embedding_function",
            default_ef
          )
          self.pre_delete_collection = config.get("pre_delete_collection",
                                                  False)
          self.cmetadata = config.get("cmetadata", {"created_by": "oracle"})
        else:
          self.embedding_function = default_ef
          self.pre_delete_collection = False
          self.cmetadata = {"created_by": "oracle"}

        self.oracle_conn = oracledb.connect(dsn=config.get("dsn"))
        self.oracle_conn.call_timeout = 30000
        self.documentation_collection = "documentation"
        self.ddl_collection = "ddl"
        self.sql_collection = "sql"
        self.n_results = config.get("n_results", 10)
        self.n_results_ddl = config.get("n_results_ddl", self.n_results)
        self.n_results_sql = config.get("n_results_sql", self.n_results)
        self.n_results_documentation = config.get("n_results_documentation",
                                                  self.n_results)
        self.create_tables_if_not_exists()
        self.create_collections_if_not_exists(self.documentation_collection)
        self.create_collections_if_not_exists(self.ddl_collection)
        self.create_collections_if_not_exists(self.sql_collection)
  • Analysis:
    • The code initializes Oracle_VectorStore, inheriting from VannaBase. It uses chromadb.utils.embedding_functions.DefaultEmbeddingFunction as the default embedding function. This explains the unexpected dependency on chromadb mentioned in the initial review.
    • The chromadb dependency is solely for the default embedding function. This introduces an unnecessary external dependency if the user intends to use other embedding functions or if a simpler default is sufficient.
    • The constructor sets up Oracle connection using oracledb.connect and configures timeouts and collection names.
    • It initializes n_results related parameters, which control the number of results returned by search queries. It's good to see these are configurable.
    • It calls methods to create tables and default collections if they don't exist. This is a standard setup procedure for vector stores.
  • LlamaPReview Suggested Improvements:
    import json
    import uuid
    from typing import List, Optional, Tuple

    import oracledb
    import pandas as pd

    from ..base import VannaBase

    class DefaultEmbeddingFunction: # LlamaPReview: Define a simple default embedding function to remove chromadb dependency
        def __call__(self, texts):
            # In a real scenario, this would use an actual embedding model.
            # For now, return a placeholder embedding (e.g., based on text length).
            return [[float(len(text))] * 128 for text in texts] # Placeholder embedding of size 128

    default_ef = DefaultEmbeddingFunction()

    class Oracle_VectorStore(VannaBase):
      def __init__(self, config=None):
        VannaBase.__init__(self, config=config)

        if config is not None:
          self.embedding_function = config.get(
            "embedding_function",
            default_ef
          )
          self.pre_delete_collection = config.get("pre_delete_collection",
                                                  False)
          self.cmetadata = config.get("cmetadata", {"created_by": "oracle"})
        else:
          self.embedding_function = default_ef
          self.pre_delete_collection = False
          self.cmetadata = {"created_by": "oracle"}

        self.oracle_conn = oracledb.connect(dsn=config.get("dsn"))
        self.oracle_conn.call_timeout = 30000
        self.documentation_collection = "documentation"
        self.ddl_collection = "ddl"
        self.sql_collection = "sql"
        self.n_results = config.get("n_results", 10)
        self.n_results_ddl = config.get("n_results_ddl", self.n_results)
        self.n_results_sql = config.get("n_results_sql", self.n_results)
        self.n_results_documentation = config.get("n_results_documentation",
                                                  self.n_results)
        self.create_tables_if_not_exists()
        self.create_collections_if_not_exists(self.documentation_collection)
        self.create_collections_if_not_exists(self.ddl_collection)
        self.create_collections_if_not_exists(self.sql_collection)
  • Improvement rationale:
    • Technical benefits: Removes the unnecessary chromadb dependency, simplifying the project and reducing potential dependency conflicts. Introduces a lightweight, self-contained default embedding function.
    • Business value: Reduces the complexity of setup and deployment. Users who don't need chromadb's default embedding function won't have to install it.
    • Risk assessment: Lowers dependency risk. The placeholder embedding function is a starting point and can be replaced with a more sophisticated default or user-provided function as needed.

Core Logic Changes

  • src/vanna/oracle/oracle_vector.py - search_tables_metadata
    • Submitted PR Code:
    def search_tables_metadata(self,
                                 engine: str = None,
                                 catalog: str = None,
                                 schema: str = None,
                                 table_name: str = None,
                                 ddl: str = None,
                                 size: int = 10,
                                 **kwargs) -> list:
      pass
  • Analysis:
    • The search_tables_metadata method is implemented with pass, meaning it does nothing. The PR description mentions that @abstractmethod search_tables_metadata in VannaBase is not implemented and suggests a potential bug in opensearch_vector.py.
    • This implementation confirms that this method is indeed not implemented in Oracle_VectorStore. It's crucial to understand if this is intentional, a temporary placeholder, or an oversight.
    • The initial review correctly points out the need to analyze this method and its relation to the potential bug mentioned in the PR description.
    • If search_tables_metadata is intended to be part of the VannaBase contract and is required for the functionality of vanna-ai/vanna, then this pass implementation is incomplete and needs to be addressed.
  • LlamaPReview Suggested Improvements: [Skip this section as the suggested improvement depends on the intended functionality of search_tables_metadata which is not clear from the PR. Further investigation is needed to understand the expected behavior of this method in VannaBase and implement it accordingly for Oracle.]
  • Improvement rationale:
    • Technical benefits: Addresses the incomplete implementation of search_tables_metadata. Fulfills the interface contract defined by VannaBase (if intended).
    • Business value: Enables the functionality related to searching table metadata, which might be crucial for the overall features of vanna-ai/vanna.
    • Risk assessment: High risk if search_tables_metadata is a required method. Leaving it as pass will lead to unexpected behavior or errors when this method is called. Further investigation is needed to determine the correct implementation.

Core Logic Changes

  • src/vanna/oracle/oracle_vector.py - get_related_ddl
    • Submitted PR Code:
    def get_related_ddl(self, question: str, **kwargs) -> list:
      collection = self.get_collection(self.ddl_collection)
      cursor = self.oracle_conn.cursor()
      cursor.setinputsizes(None, oracledb.DB_TYPE_VECTOR)
      cursor.execute(
        """
            SELECT
                document
            FROM oracle_embedding
            WHERE
                collection_id = :1
            ORDER BY VECTOR_DISTANCE(embedding, TO_VECTOR(:2), COSINE)
            FETCH FIRST :top_k ROWS ONLY
        """, [
          collection["uuid"],
          self.generate_embedding(question),
          100
        ]
      )
      results = cursor.fetchall()

      self.oracle_conn.commit()
      cursor.close()
      return Oracle_VectorStore._extract_documents(results)
  • Analysis:
    • In get_related_ddl, the FETCH FIRST clause uses a hardcoded value 100 instead of using the configured n_results_ddl or n_results. This inconsistency can lead to unexpected behavior and makes the number of returned results less configurable than intended.
    • Similar hardcoding is also present in get_related_documentation.
    • The initial review did not explicitly mention this hardcoding issue.
  • LlamaPReview Suggested Improvements:
    def get_related_ddl(self, question: str, **kwargs) -> list:
      collection = self.get_collection(self.ddl_collection)
      cursor = self.oracle_conn.cursor()
      cursor.setinputsizes(None, oracledb.DB_TYPE_VECTOR)
      cursor.execute(
        """
            SELECT
                document
            FROM oracle_embedding
            WHERE
                collection_id = :1
            ORDER BY VECTOR_DISTANCE(embedding, TO_VECTOR(:2), COSINE)
            FETCH FIRST :top_k ROWS ONLY
        """, [
          collection["uuid"],
          self.generate_embedding(question),
          self.n_results_ddl # LlamaPReview: Use n_results_ddl instead of hardcoded 100
        ]
      )
      results = cursor.fetchall()

      self.oracle_conn.commit()
      cursor.close()
      return Oracle_VectorStore._extract_documents(results)
  • Improvement rationale:
    • Technical benefits: Fixes inconsistency by using the configurable n_results_ddl parameter, making the behavior consistent with other similar methods and respecting user configurations.
    • Business value: Provides expected configurability for users who want to control the number of results returned by DDL related queries.
    • Risk assessment: Low risk. Ensures consistent behavior and better configurability.

Core Logic Changes

  • src/vanna/oracle/oracle_vector.py - update_training_data
    • Submitted PR Code:
    def update_training_data(self, id: str, train_type: str, question: str,
                             **kwargs) -> bool:
      print(f"{train_type=}")
      update_content = kwargs["content"]
      if train_type == 'sql':
        update_json = json.dumps(
          {
            "question": question,
            "sql": update_content,
          }
        )
      elif train_type == 'ddl':
        update_json = json.dumps(
          {
            "question": None,
            "ddl": update_content,
          }
        )
      elif train_type == 'documentation':
        update_json = json.dumps(
          {
            "question": None,
            "documentation": update_content,
          }
        )
      else:
        update_json = json.dumps(
          {
            "question": question,
            "sql": update_content,
          }
        )
      cursor = self.oracle_conn.cursor()
      cursor.setinputsizes(oracledb.DB_TYPE_VECTOR, oracledb.DB_TYPE_JSON)
      cursor.execute(
        """
                    UPDATE
                        oracle_embedding
                    SET
                        embedding = TO_VECTOR(:1),
                        document = JSON_MERGEPATCH(document, :2)
                    WHERE
                    uuid  = :3
                    """, [
          self.generate_embedding(update_content),
          update_json,
          id
        ]
      )

      self.oracle_conn.commit()
      cursor.close()
      return True
  • Analysis:
    • In update_training_data, the logic for constructing update_json is somewhat repetitive. The else condition duplicates the 'sql' case. This could be simplified.
    • More importantly, the code uses JSON_MERGEPATCH to update the document column. It's essential to understand the behavior of JSON_MERGEPATCH in OracleDB, especially regarding how it handles nested JSON structures and potential data loss or corruption if the existing document structure is not as expected. It's crucial to ensure that JSON_MERGEPATCH is the correct function for the intended update semantics. If the goal is to replace the entire document content based on train_type and update_content, then JSON_MERGEPATCH might be unnecessarily complex and potentially error-prone compared to simply replacing the entire document JSON.
    • The initial review did not delve into the specifics of JSON_MERGEPATCH usage.
  • LlamaPReview Suggested Improvements:
    def update_training_data(self, id: str, train_type: str, question: str,
                             **kwargs) -> bool:
      print(f"{train_type=}")
      update_content = kwargs["content"]
      if train_type == 'sql':
        update_json = json.dumps(
          {
            "question": question,
            "sql": update_content,
          }
        )
      elif train_type == 'ddl':
        update_json = json.dumps(
          {
            "question": None,
            "ddl": update_content,
          }
        )
      elif train_type == 'documentation':
        update_json = json.dumps(
          {
            "question": None,
            "documentation": update_content,
          }
        )
      else: # LlamaPReview: Consider raising an exception for unknown train_type instead of defaulting to 'sql'
          raise ValueError(f"Unknown train_type: {train_type}") # LlamaPReview: Raise ValueError for unknown train_type

      cursor = self.oracle_conn.cursor()
      cursor.setinputsizes(oracledb.DB_TYPE_VECTOR, oracledb.DB_TYPE_JSON)
      cursor.execute(
        """
                    UPDATE
                        oracle_embedding
                    SET
                        embedding = TO_VECTOR(:1),
                        document = :2 # LlamaPReview: Replace entire document instead of using JSON_MERGEPATCH for simplicity and clarity
                    WHERE
                    uuid  = :3
                    """, [
          self.generate_embedding(update_content),
          update_json, # LlamaPReview: Use update_json directly to replace the document
          id
        ]
      )

      self.oracle_conn.commit()
      cursor.close()
      return True
  • Improvement rationale:
    • Technical benefits: Simplifies the UPDATE query by replacing the entire document JSON, making the update logic clearer and potentially more efficient. Removes the ambiguity and potential complexity of JSON_MERGEPATCH. Adds error handling for unknown train_type values, improving robustness.
    • Business value: Reduces potential data corruption risks associated with incorrect JSON_MERGEPATCH usage. Makes the code easier to understand and maintain.
    • Risk assessment: Low risk. Replacing the entire document is a safer and more predictable operation in this context, assuming the intention is to update the content based on train_type. Raising an error for unknown train_type prevents unexpected behavior.

2.2 Implementation Quality

  • Code organization and structure: The code is well-organized, with clear separation of concerns. The Oracle_VectorStore class is well-structured, and methods are logically grouped.
  • Design patterns usage: The class follows the design patterns established in VannaBase, ensuring consistency with other VectorStore implementations.
  • Error handling approach: The code uses try-except blocks where necessary, and parameterized queries are used to prevent SQL injection risks. However, more robust error handling could be added, especially for database connection issues.
  • Resource management: The code properly manages database connections and cursors, ensuring they are closed after use. However, more explicit error handling for resource management would improve robustness.

3. Critical Findings

3.1 Potential Issues

  • 🔴 Critical Issues

    • Incomplete Implementation of search_tables_metadata: The search_tables_metadata method is implemented with pass, meaning it does nothing. This is a critical issue if this method is part of the VannaBase contract and is required for the functionality of vanna-ai/vanna.
      • Impact: Leaving this method as pass will lead to unexpected behavior or errors when this method is called.
      • Recommendation: Investigate the intended functionality of search_tables_metadata in VannaBase and implement it accordingly for Oracle.
  • 🟡 Warnings

    • Unnecessary Dependency on chromadb: The chromadb dependency is solely for the default embedding function. This introduces an unnecessary external dependency if the user intends to use other embedding functions or if a simpler default is sufficient.
      • Potential risks: Increases the complexity of setup and deployment. Users who don't need chromadb's default embedding function will have to install it.
      • Suggested improvements: Remove the chromadb dependency and introduce a lightweight, self-contained default embedding function.
    • Hardcoded Values in get_related_ddl and get_related_documentation: The FETCH FIRST clause uses a hardcoded value 100 instead of using the configured n_results_ddl or n_results. This inconsistency can lead to unexpected behavior and makes the number of returned results less configurable than intended.
      • Potential risks: Inconsistent behavior and reduced configurability.
      • Suggested improvements: Use the configurable n_results_ddl parameter in get_related_ddl and similar parameters in other methods to ensure consistent behavior and better configurability.

3.2 Code Quality Concerns

  • Maintainability aspects: The code is generally maintainable, but the use of JSON_MERGEPATCH in update_training_data introduces complexity and potential risks. Simplifying this logic would improve maintainability.
  • Readability issues: The code is readable, but some repetitive logic in update_training_data could be simplified for better readability.
  • Performance bottlenecks: The use of JSON_MERGEPATCH in update_training_data could introduce performance bottlenecks, especially if the existing document structure is not as expected. Replacing the entire document JSON would be more efficient.

4. Security Assessment

  • Authentication/Authorization impacts: The code does not directly handle authentication or authorization, but it is crucial to ensure secure database connection practices are implemented, especially regarding credential management.
  • Data handling concerns: The code handles embedding data and related metadata. It is essential to ensure that this data is handled securely and that sensitive information is not exposed.
  • Input validation: The code uses parameterized queries, which helps prevent SQL injection risks. However, additional input validation could be added to ensure the robustness of the code.
  • Security best practices: Follow security best practices for database connections, including secure credential management and input validation.
  • Potential security risks: The primary security risk is related to database connection practices and credential management. Ensure that these are handled securely.
  • Mitigation strategies: Implement secure credential management practices and ensure that database connections are handled securely.
  • Security testing requirements: Conduct security testing to ensure that the code handles database connections and data securely.

5. Testing Strategy

5.1 Test Coverage

  • Unit test analysis: Ensure that unit tests are implemented for all methods in Oracle_VectorStore. This includes testing the initialization, embedding generation, and CRUD operations on the embeddings stored in Oracle Database.
  • Integration test requirements: Implement integration tests with a live Oracle 23ai database instance to ensure real-world compatibility and functionality.
  • Edge cases coverage: Ensure that edge cases are covered in the tests, including handling of different train_type values in update_training_data and different configurations for n_results parameters.

5.2 Test Recommendations

Suggested Test Cases

  # Sample test code for update_training_data
  def test_update_training_data_sql(self):
      store = Oracle_VectorStore(config={"dsn": "your_dsn"})
      id = store.add_question_sql("test question", "test sql")
      store.update_training_data(id, "sql", "test question", content="updated sql")
      result = store.get_similar_question_sql("test question")
      assert result[0]["sql"] == "updated sql"

  def test_update_training_data_ddl(self):
      store = Oracle_VectorStore(config={"dsn": "your_dsn"})
      id = store.add_ddl("test ddl")
      store.update_training_data(id, "ddl", None, content="updated ddl")
      result = store.get_related_ddl("test question")
      assert result[0]["ddl"] == "updated ddl"
  • Coverage improvements: Ensure that all methods and edge cases are covered in the tests.
  • Performance testing needs: Conduct performance testing to ensure that the code handles large datasets efficiently.

6. Documentation & Maintenance

  • Documentation updates needed (API, architecture, configuration): Update the documentation to include the new Oracle_VectorStore class and its methods. Ensure that the configuration options are clearly documented.
  • Long-term maintenance considerations: Ensure that the code is maintainable and that any dependencies are well-documented. Remove unnecessary dependencies to simplify maintenance.
  • Technical debt and monitoring requirements: Monitor the performance and security of the code. Address any technical debt related to the use of JSON_MERGEPATCH and hardcoded values.

7. Deployment & Operations

  • Deployment impact and strategy: Ensure that the deployment process includes the installation of the oracledb dependency. Remove the chromadb dependency to simplify deployment.
  • Key operational considerations: Monitor the performance and security of the database connections. Ensure that the database is properly configured to handle the vector operations.

8. Summary & Recommendations

8.1 Key Action Items

  1. Investigate and implement search_tables_metadata: Determine the intended functionality of search_tables_metadata in VannaBase and implement it accordingly for Oracle.
  2. Remove chromadb dependency: Introduce a lightweight, self-contained default embedding function to remove the unnecessary chromadb dependency.
  3. Use configurable parameters in get_related_ddl and similar methods: Replace hardcoded values with configurable parameters to ensure consistent behavior and better configurability.
  4. Simplify update_training_data: Replace the use of JSON_MERGEPATCH with a simpler approach to update the entire document JSON, improving maintainability and performance.

8.2 Future Considerations

  • Technical evolution path: Continuously improve the implementation of Oracle_VectorStore based on feedback and testing results.
  • Business capability evolution: Expand the capabilities of vanna-ai/vanna to support more database systems and improve the overall functionality.
  • System integration impacts: Ensure that the integration with Oracle Database is robust and secure, and that it meets the needs of the users.

💡 Help Shape LlamaPReview
How's this review format working for you? Vote in our Github Discussion Polls to help us improve your review experience!

@zainhoda zainhoda merged commit 9388668 into vanna-ai:main Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants