Skip to content

Commit

Permalink
vdk-smarter: introduce simple open ai plugin
Browse files Browse the repository at this point in the history
Very simple plugin that reviews every query and scores it using OpenAI
models. The purpose is to provide easy visibility into queries and their
quality for the user. Though technically they could potentially use
chatGPT themselves or static code analysis , that is less likely and
alsl generally harder

A) because some queries are generated dynamically
B) lots of queries could be part of a job and takes time to check all.
C) the prompt engineering is something that takes time as well and
tuning as well. The prompts could be tuned by community pretty well
over time.

This enables central data team to establish best practices around SQL
with proper review model .

Signed-off-by: Antoni Ivanov <[email protected]>
  • Loading branch information
antoniivanov committed Jul 4, 2023
1 parent eef7721 commit 11829ae
Show file tree
Hide file tree
Showing 5 changed files with 280 additions and 0 deletions.
25 changes: 25 additions & 0 deletions projects/vdk-plugins/vdk-smarter/.plugin-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright 2021-2023 VMware, Inc.
# SPDX-License-Identifier: Apache-2.0

gi# Copyright 2021-2023 VMware, Inc.
# SPDX-License-Identifier: Apache-2.0

image: "python:3.7"

.build-vdk-smarter:
variables:
PLUGIN_NAME: vdk-smarter
extends: .build-plugin

build-py37-vdk-smarter:
extends: .build-vdk-smarter
image: "python:3.7"

build-py311-vdk-smarter:
extends: .build-vdk-smarter
image: "python:3.11"

release-vdk-smarter:
variables:
PLUGIN_NAME: vdk-smarter
extends: .release-plugin
100 changes: 100 additions & 0 deletions projects/vdk-plugins/vdk-smarter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# VDK Smarter

Making VDK smarter by employing ML/AI.



## Usage

```
pip install vdk-smarter
```

### Configuration

(`vdk config-help` is useful command to browse all config options of your installation of vdk)


### Example

TODO# VDK Smarter

Making VDK smarter by employing ML/AI.



## Usage

```
pip install vdk-smarter
```

### Configuration

(`vdk config-help` and search for configuration starting with "openai")

### Example

By default reviews are disabled since they are expensive.

To enable you need to set `openai_review_enabled` to `true` in the configuration and `openai_api_key`.
You can see `vdk config-help` (search for configuration with openai prefix) for more information.

Once enabled on vdk run each query statement executed will be also reviewed and scored.

```
vdk run example
```
```bash
Query:
CREATE TABLE IF NOT EXISTS super_collider.example_table(
vc_id STRING,
esx_id STRING,
vm_count INT
) STORED AS PARQUET;


Review:
{
"score": 5,
"review": "No further changes needed. The query is efficient, readable, and follows best practices.
There are no potential errors, optimization, or security vulnerabilities."}

...
Query:
CREATE TABLE IF NOT EXISTS super_collider.example_fact_snapshot(
vc_id STRING,
esx_count BIGINT,
vm_count BIGINT
) STORED AS PARQUET;


Review:
{"score": 4, "review": "The query is well-structured and follows best practices.
However, there is an opportunity to improve its readability by
adding comments to explain the purpose of the query and the meaning of the parameters.
Additionally, there is potential to optimize the query by providing more precise column types."}
```
At the end a report is generated `queries_reviews_report.md` with all the queries and their reviews.
### Build and testing
```
pip install -r requirements.txt
pip install -e .
pytest
```
In VDK repo [../build-plugin.sh](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/build-plugin.sh) script can be used also.
#### Note about the CICD:
.plugin-ci.yaml is needed only for plugins part of [Versatile Data Kit Plugin repo](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins).
The CI/CD is separated in two stages, a build stage and a release stage.
The build stage is made up of a few jobs, all which inherit from the same
job configuration and only differ in the Python version they use (3.7, 3.8, 3.9 and 3.10).
They run according to rules, which are ordered in a way such that changes to a
plugin's directory trigger the plugin CI, but changes to a different plugin does not.
8 changes: 8 additions & 0 deletions projects/vdk-plugins/vdk-smarter/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# this file is used to provide testing requirements
# for requirements (dependencies) needed during and after installation of the plugin see (and update) setup.py install_requires section

openai

pytest
vdk-core
vdk-test-utils
37 changes: 37 additions & 0 deletions projects/vdk-plugins/vdk-smarter/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright 2021-2023 VMware, Inc.
# SPDX-License-Identifier: Apache-2.0
import pathlib

import setuptools

"""
Builds a package with the help of setuptools in order for this package to be imported in other projects
"""

__version__ = "0.1.0"

setuptools.setup(
name="vdk-smarter",
version=__version__,
url="https://github.com/vmware/versatile-data-kit",
description="Making VDK smarter by employing ML/AI.",
long_description=pathlib.Path("README.md").read_text(),
long_description_content_type="text/markdown",
install_requires=["vdk-core", "openai"],
package_dir={"": "src"},
packages=setuptools.find_namespace_packages(where="src"),
# This is the only vdk plugin specifc part
# Define entry point called "vdk.plugin.run" with name of plugin and module to act as entry point.
entry_points={
"vdk.plugin.run": ["vdk-smarter = vdk.plugin.smarter.openai_plugin_entry"]
},
classifiers=[
"Development Status :: 2 - Pre-Alpha",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
],
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Copyright 2021-2023 VMware, Inc.
# SPDX-License-Identifier: Apache-2.0
import logging
from collections import OrderedDict
from typing import List

import openai
from vdk.api.plugin.hook_markers import hookimpl
from vdk.api.plugin.plugin_registry import IPluginRegistry
from vdk.internal.builtin_plugins.connection.decoration_cursor import DecorationCursor
from vdk.internal.builtin_plugins.connection.execution_cursor import ExecutionCursor
from vdk.internal.core.config import ConfigurationBuilder
from vdk.internal.core.context import CoreContext

log = logging.getLogger(__name__)


class OpenAiPlugin:
def __init__(self):
self._review_enabled = False
self._queries = OrderedDict()
self._openai_model = "gpt-3.5-turbo"

@hookimpl(tryfirst=True)
def vdk_configure(self, config_builder: ConfigurationBuilder):
# TODO: support non open ai models and make it configurable
config_builder.add(
key="openai_api_key",
default_value="",
description="""
OpenAI API key. You can generete one on your OpenAI account page.
(possibly https://platform.openai.com/account/api-keys)
See best practices for api keys in https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
""",
)
config_builder.add(
key="openai_model",
default_value="text-davinci-003",
description="""
OpenAI model to be used.
See more in https://platform.openai.com/docs/models/overview
""",
)
config_builder.add(
key="openai_review_enabled",
default_value=False,
description="If enabled, it will review each SQL query executed by the job and create summary file at the end.",
)

@hookimpl
def vdk_initialize(self, context: CoreContext) -> None:
openai.api_key = context.configuration.get_value("openai_api_key")
self._review_enabled = context.configuration.get_value("openai_review_enabled")
self._openai_model = context.configuration.get_value("openai_model")

def _review_sql_query(self, sql_query: str):
# Refine the prompt and make configurable
prompt = (
"""Using your extensive knowledge of Impala SQL, analyze the following SQL query and provide a specific feedback.
The feedback should include its efficiency, readability, possible optimization, potential errors,
adherence to best practices, and security vulnerabilities, if any. Provide a score (1 a lot of work needed, 5 - no further changes needed)
Return the answer in format {"score": ?, "review": "?" } Here is the SQL query:
"""
+ sql_query
)

# Generate the review
# TODO: make configurable most things
response = openai.Completion.create(
engine=self._openai_model,
prompt=prompt,
max_tokens=1000,
n=1,
stop=None,
temperature=0.7,
)

# Extract the generated review from the response
review = response.choices[0].text.strip()
self._queries[sql_query] = review

return review

@hookimpl
def db_connection_decorate_operation(
self, decoration_cursor: DecorationCursor
) -> None:
if self._review_enabled:
try:
managed_operation = decoration_cursor.get_managed_operation()
review = self._review_sql_query(managed_operation.get_operation())
log.info(
f"Query:\n{managed_operation.get_operation()}\n\nReview:\n{review}\n"
)
except Exception as e:
log.error(f"Failed to review SQL query: {e}")

@hookimpl
def vdk_exit(self, context: CoreContext, exit_code: int) -> None:
if self._review_enabled:
with open("queries_reviews_report.md", "w") as f:
f.write("# SQL Query Reviews\n")
for query, review in self._queries.items():
f.write(f"## SQL Query\n\n```sql\n{query}\n```\n\n")
f.write(f"### Review\n\n{review}\n\n")


@hookimpl
def vdk_start(plugin_registry: IPluginRegistry, command_line_args: List):
plugin_registry.load_plugin_with_hooks_impl(OpenAiPlugin(), "OpenAiPlugin")

0 comments on commit 11829ae

Please sign in to comment.