Skip to content

A tool for automatically mapping and visualizing SQL dependencies with LLM-powered analysis.

License

Notifications You must be signed in to change notification settings

glue-lab/sqldeps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

41 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SQLDeps: SQL Dependency Extractor

SQLDeps Logo

Test Documentation Supported Python versions Package version License

A tool that automatically extracts and maps SQL dependencies and outputs using Large Language Models (LLMs).



Overview

SQLDeps analyzes SQL scripts to identify:

  1. Dependencies: Tables and columns that must exist BEFORE query execution
  2. Outputs: Tables and columns permanently CREATED or MODIFIED by the query

It intelligently filters out temporary constructs like CTEs and derived tables, focusing only on the real database objects that matter.

Benefits

  • ๐Ÿ› ๏ธ Change Management: Safely modify schemas by identifying true dependencies
  • ๐Ÿ’พ Storage Optimization: Focus resources on essential tables and columns
  • ๐Ÿšข Migration Planning: Precisely determine what needs to be migrated
  • ๐Ÿ“ Project Documentation: Create comprehensive maps of database dependencies

Installation

pip install sqldeps

For additional functionality:

# Install with web app dependencies
pip install "sqldeps[app]"

# Install with data visualization dependencies
pip install "sqldeps[dataviz]"

# Install all optional dependencies
pip install "sqldeps[app,dataviz]"

Quick Start

SQLDeps provides both API and CLI interfaces:

  • API: Flexible for Python developers to integrate into scripts, notebooks, or applications.
  • CLI: Fast and user-friendly for analyzing files or folders directly from the command line.

API Usage

from sqldeps.llm_parsers import create_extractor

# Create extractor with default settings (framework="groq", model="llama-3.3-70b-versatile")
extractor = create_extractor()

# Extract dependencies and outputs from a SQL query
sql_query = """
WITH user_orders AS (
    SELECT o.user_id, COUNT(*) AS order_count
    FROM orders o
    JOIN users u ON o.user_id = u.id
    WHERE u.status = 'active'
    GROUP BY o.user_id
)

CREATE TABLE transactions.user_order_summary AS
SELECT * FROM user_orders;
"""
result = extractor.extract_from_query(sql_query)

# Print the results
print("Dependencies:")
print(result.dependencies)
print("\nOutputs:")
print(result.outputs)

# Or extract from a file
result = extractor.extract_from_file('path/to/query.sql')

# Convert to dictionary or DataFrame
dict_format = result.to_dict()
df_format = result.to_dataframe()

CLI Usage

# Basic example with default settings
sqldeps extract path/to/query.sql

# Specify framework and output format
sqldeps extract path/to/query.sql --framework=openai --model=gpt-4o -o results.json

# Scan a folder recursively with intelligent parallelization
sqldeps extract \
    data/sql_folder \       # Automatically detect if path is file or folder       
    --recursive \           # Scan folder recursively
    --framework=deepseek \  # Specify framework/provider
    --rpm 50                # Maximum 50 requests per minute
    --n-workers -1 \        # Use all available processors
    -o results.csv          # Output a dataframe as CSV instead of JSON
# Get help on available commands
sqldeps --help

# Get help on extract - the main command
sqldeps extract --help

Web Application

SQLDeps also includes a Streamlit-based web interface:

# Run the web app
sqldeps app

Note: The web application is designed for single-file extraction and demonstration purposes. For processing multiple files or entire folders, use the API or CLI instead.

Example

Given this SQL query:

-- Common Table Expression (CTE) to count user orders for active users
WITH user_orders AS (
    SELECT o.user_id, COUNT(*) AS order_count
    FROM orders o
    JOIN users u ON o.user_id = u.id
    WHERE u.status = 'active'
    GROUP BY o.user_id
)

-- Create a new table from the CTE
CREATE TABLE transactions.user_order_summary AS
SELECT * FROM user_orders;

SQLDeps will extract:

{
  "dependencies": {
    "orders": ["user_id"],
    "users": ["id", "status"]
  },
  "outputs": {
    "transactions.user_order_summary": ["*"]
  }
}

Notice how:

  • CTE (user_orders) is correctly excluded
  • Real source tables (orders, users) are included as dependencies
  • Target table (transactions.user_order_summary) is correctly identified as output

Supported Models

All models available on Groq, OpenAI, and DeepSeek.
For up-to-date pricing details, please check Groq, OpenAI, DeepSeek.

API Keys / Configuration

You'll need to set up API keys for your chosen LLM provider. Create a .env file in your project root:

# LLM API Keys
GROQ_API_KEY=your_groq_api_key
OPENAI_API_KEY=your_openai_api_key
DEEPSEEK_API_KEY=your_deepseek_api_key

# Database credentials (for schema validation)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=mydatabase
DB_USER=username
DB_PASSWORD=password

Tip: Groq offers free tokens without requiring payment details, making it ideal for getting started quickly.

Advanced Usage

Database Schema Matching

SQLDeps allows the user to match SQLDeps results (table/column dependencies and outputs) with database schemas to retrieve column data types.

from sqldeps.database import PostgreSQLConnector
from sqldeps.llm_parsers import create_extractor

# Extract dependencies
extractor = create_extractor(framework="openai", model="gpt-4o")
result = extractor.extract_from_file('query.sql')

# Connect to database and validate
conn = PostgreSQLConnector(
    host="localhost",
    port=5432,
    database="mydatabase",
    username="username"
)

# Match extracted dependencies against database schema
matching_schema = extractor.match_database_schema(
    result,
    db_connection=conn,
    target_schemas=["public", "sales"]
)

# View validation results as a pandas DataFrame
print(matching_schema)

For custom database YAML configuration file (optional):

# database.yml
database:
  host: localhost
  port: 5432
  database: mydatabase
  username: username
  password: password

Using Custom Prompts

You can customize the prompts used to instruct the LLM:

# Create extractor with custom prompt
extractor = create_extractor(
    framework="groq",
    model="llama-3.3-70b-versatile",
    prompt_path="path/to/custom_prompt.yml"
)

The custom prompt YAML should include:

system_prompt: |
  You are a SQL analyzer that extracts two key elements from SQL queries:

  1. DEPENDENCIES: Tables and columns that must exist BEFORE query execution.
  2. OUTPUTS: Tables and columns permanently CREATED or MODIFIED by the query.

  # Add detailed instructions for the LLM here...

user_prompt: |
  Extract SQL dependencies and outputs from this query:
  {sql}

Interactive Visualization of SQL Dependency Graphs

SQLDeps provides built-in visualization capabilities to help you understand complex SQL dependencies:

from sqldeps.llm_parsers import create_extractor
from sqldeps.visualization import visualize_sql_dependencies

# Create an interactive network graph from multiple SQL files
extractor = create_extractor()
sql_profiles = extractor.extract_from_folder("path/to/folder", recursive=False)

# Generate an interactive visualization (saving output to an HTML file)
figure = visualize_sql_dependencies(sql_profiles, output_path="dependencies.html")

# Show figure
figure.show()

Documentation

For comprehensive documentation, including API reference and examples, visit https://sqldeps.readthedocs.io.

Contributing

Contributions are welcome!

Please check out the Contributing Guide for details.

License

MIT

About

A tool for automatically mapping and visualizing SQL dependencies with LLM-powered analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published