-
Notifications
You must be signed in to change notification settings - Fork 59
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
vdk-smarter: introduce simple open ai plugin
Very simple plugin that reviews every query and scores it using OpenAI models. The purpose is to provide easy visibility into queries and their quality for the user. Though technically they could potentially use chatGPT themselves or static code analysis , that is less likely and alsl generally harder A) because some queries are generated dynamically B) lots of queries could be part of a job and takes time to check all. C) the prompt engineering is something that takes time as well and tuning as well. The prompts could be tuned by community pretty well over time. This enables central data team to establish best practices around SQL with proper review model . Signed-off-by: Antoni Ivanov <[email protected]>
- Loading branch information
1 parent
eef7721
commit 11829ae
Showing
5 changed files
with
280 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
gi# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
image: "python:3.7" | ||
|
||
.build-vdk-smarter: | ||
variables: | ||
PLUGIN_NAME: vdk-smarter | ||
extends: .build-plugin | ||
|
||
build-py37-vdk-smarter: | ||
extends: .build-vdk-smarter | ||
image: "python:3.7" | ||
|
||
build-py311-vdk-smarter: | ||
extends: .build-vdk-smarter | ||
image: "python:3.11" | ||
|
||
release-vdk-smarter: | ||
variables: | ||
PLUGIN_NAME: vdk-smarter | ||
extends: .release-plugin |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# VDK Smarter | ||
|
||
Making VDK smarter by employing ML/AI. | ||
|
||
|
||
|
||
## Usage | ||
|
||
``` | ||
pip install vdk-smarter | ||
``` | ||
|
||
### Configuration | ||
|
||
(`vdk config-help` is useful command to browse all config options of your installation of vdk) | ||
|
||
|
||
### Example | ||
|
||
TODO# VDK Smarter | ||
|
||
Making VDK smarter by employing ML/AI. | ||
|
||
|
||
|
||
## Usage | ||
|
||
``` | ||
pip install vdk-smarter | ||
``` | ||
|
||
### Configuration | ||
|
||
(`vdk config-help` and search for configuration starting with "openai") | ||
|
||
### Example | ||
|
||
By default reviews are disabled since they are expensive. | ||
|
||
To enable you need to set `openai_review_enabled` to `true` in the configuration and `openai_api_key`. | ||
You can see `vdk config-help` (search for configuration with openai prefix) for more information. | ||
|
||
Once enabled on vdk run each query statement executed will be also reviewed and scored. | ||
|
||
``` | ||
vdk run example | ||
``` | ||
```bash | ||
Query: | ||
CREATE TABLE IF NOT EXISTS super_collider.example_table( | ||
vc_id STRING, | ||
esx_id STRING, | ||
vm_count INT | ||
) STORED AS PARQUET; | ||
|
||
|
||
Review: | ||
{ | ||
"score": 5, | ||
"review": "No further changes needed. The query is efficient, readable, and follows best practices. | ||
There are no potential errors, optimization, or security vulnerabilities."} | ||
|
||
... | ||
Query: | ||
CREATE TABLE IF NOT EXISTS super_collider.example_fact_snapshot( | ||
vc_id STRING, | ||
esx_count BIGINT, | ||
vm_count BIGINT | ||
) STORED AS PARQUET; | ||
|
||
|
||
Review: | ||
{"score": 4, "review": "The query is well-structured and follows best practices. | ||
However, there is an opportunity to improve its readability by | ||
adding comments to explain the purpose of the query and the meaning of the parameters. | ||
Additionally, there is potential to optimize the query by providing more precise column types."} | ||
``` | ||
At the end a report is generated `queries_reviews_report.md` with all the queries and their reviews. | ||
### Build and testing | ||
``` | ||
pip install -r requirements.txt | ||
pip install -e . | ||
pytest | ||
``` | ||
In VDK repo [../build-plugin.sh](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/build-plugin.sh) script can be used also. | ||
#### Note about the CICD: | ||
.plugin-ci.yaml is needed only for plugins part of [Versatile Data Kit Plugin repo](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins). | ||
The CI/CD is separated in two stages, a build stage and a release stage. | ||
The build stage is made up of a few jobs, all which inherit from the same | ||
job configuration and only differ in the Python version they use (3.7, 3.8, 3.9 and 3.10). | ||
They run according to rules, which are ordered in a way such that changes to a | ||
plugin's directory trigger the plugin CI, but changes to a different plugin does not. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# this file is used to provide testing requirements | ||
# for requirements (dependencies) needed during and after installation of the plugin see (and update) setup.py install_requires section | ||
|
||
openai | ||
|
||
pytest | ||
vdk-core | ||
vdk-test-utils |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import pathlib | ||
|
||
import setuptools | ||
|
||
""" | ||
Builds a package with the help of setuptools in order for this package to be imported in other projects | ||
""" | ||
|
||
__version__ = "0.1.0" | ||
|
||
setuptools.setup( | ||
name="vdk-smarter", | ||
version=__version__, | ||
url="https://github.com/vmware/versatile-data-kit", | ||
description="Making VDK smarter by employing ML/AI.", | ||
long_description=pathlib.Path("README.md").read_text(), | ||
long_description_content_type="text/markdown", | ||
install_requires=["vdk-core", "openai"], | ||
package_dir={"": "src"}, | ||
packages=setuptools.find_namespace_packages(where="src"), | ||
# This is the only vdk plugin specifc part | ||
# Define entry point called "vdk.plugin.run" with name of plugin and module to act as entry point. | ||
entry_points={ | ||
"vdk.plugin.run": ["vdk-smarter = vdk.plugin.smarter.openai_plugin_entry"] | ||
}, | ||
classifiers=[ | ||
"Development Status :: 2 - Pre-Alpha", | ||
"License :: OSI Approved :: Apache Software License", | ||
"Programming Language :: Python :: 3.7", | ||
"Programming Language :: Python :: 3.8", | ||
"Programming Language :: Python :: 3.9", | ||
"Programming Language :: Python :: 3.10", | ||
"Programming Language :: Python :: 3.11", | ||
], | ||
) |
110 changes: 110 additions & 0 deletions
110
projects/vdk-plugins/vdk-smarter/src/vdk/plugin/smarter/openai_plugin_entry.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# Copyright 2021-2023 VMware, Inc. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
import logging | ||
from collections import OrderedDict | ||
from typing import List | ||
|
||
import openai | ||
from vdk.api.plugin.hook_markers import hookimpl | ||
from vdk.api.plugin.plugin_registry import IPluginRegistry | ||
from vdk.internal.builtin_plugins.connection.decoration_cursor import DecorationCursor | ||
from vdk.internal.builtin_plugins.connection.execution_cursor import ExecutionCursor | ||
from vdk.internal.core.config import ConfigurationBuilder | ||
from vdk.internal.core.context import CoreContext | ||
|
||
log = logging.getLogger(__name__) | ||
|
||
|
||
class OpenAiPlugin: | ||
def __init__(self): | ||
self._review_enabled = False | ||
self._queries = OrderedDict() | ||
self._openai_model = "gpt-3.5-turbo" | ||
|
||
@hookimpl(tryfirst=True) | ||
def vdk_configure(self, config_builder: ConfigurationBuilder): | ||
# TODO: support non open ai models and make it configurable | ||
config_builder.add( | ||
key="openai_api_key", | ||
default_value="", | ||
description=""" | ||
OpenAI API key. You can generete one on your OpenAI account page. | ||
(possibly https://platform.openai.com/account/api-keys) | ||
See best practices for api keys in https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety | ||
""", | ||
) | ||
config_builder.add( | ||
key="openai_model", | ||
default_value="text-davinci-003", | ||
description=""" | ||
OpenAI model to be used. | ||
See more in https://platform.openai.com/docs/models/overview | ||
""", | ||
) | ||
config_builder.add( | ||
key="openai_review_enabled", | ||
default_value=False, | ||
description="If enabled, it will review each SQL query executed by the job and create summary file at the end.", | ||
) | ||
|
||
@hookimpl | ||
def vdk_initialize(self, context: CoreContext) -> None: | ||
openai.api_key = context.configuration.get_value("openai_api_key") | ||
self._review_enabled = context.configuration.get_value("openai_review_enabled") | ||
self._openai_model = context.configuration.get_value("openai_model") | ||
|
||
def _review_sql_query(self, sql_query: str): | ||
# Refine the prompt and make configurable | ||
prompt = ( | ||
"""Using your extensive knowledge of Impala SQL, analyze the following SQL query and provide a specific feedback. | ||
The feedback should include its efficiency, readability, possible optimization, potential errors, | ||
adherence to best practices, and security vulnerabilities, if any. Provide a score (1 a lot of work needed, 5 - no further changes needed) | ||
Return the answer in format {"score": ?, "review": "?" } Here is the SQL query: | ||
""" | ||
+ sql_query | ||
) | ||
|
||
# Generate the review | ||
# TODO: make configurable most things | ||
response = openai.Completion.create( | ||
engine=self._openai_model, | ||
prompt=prompt, | ||
max_tokens=1000, | ||
n=1, | ||
stop=None, | ||
temperature=0.7, | ||
) | ||
|
||
# Extract the generated review from the response | ||
review = response.choices[0].text.strip() | ||
self._queries[sql_query] = review | ||
|
||
return review | ||
|
||
@hookimpl | ||
def db_connection_decorate_operation( | ||
self, decoration_cursor: DecorationCursor | ||
) -> None: | ||
if self._review_enabled: | ||
try: | ||
managed_operation = decoration_cursor.get_managed_operation() | ||
review = self._review_sql_query(managed_operation.get_operation()) | ||
log.info( | ||
f"Query:\n{managed_operation.get_operation()}\n\nReview:\n{review}\n" | ||
) | ||
except Exception as e: | ||
log.error(f"Failed to review SQL query: {e}") | ||
|
||
@hookimpl | ||
def vdk_exit(self, context: CoreContext, exit_code: int) -> None: | ||
if self._review_enabled: | ||
with open("queries_reviews_report.md", "w") as f: | ||
f.write("# SQL Query Reviews\n") | ||
for query, review in self._queries.items(): | ||
f.write(f"## SQL Query\n\n```sql\n{query}\n```\n\n") | ||
f.write(f"### Review\n\n{review}\n\n") | ||
|
||
|
||
@hookimpl | ||
def vdk_start(plugin_registry: IPluginRegistry, command_line_args: List): | ||
plugin_registry.load_plugin_with_hooks_impl(OpenAiPlugin(), "OpenAiPlugin") |