A tool that automatically extracts and maps SQL dependencies and outputs using Large Language Models (LLMs).
- Documentation: https://sqldeps.readthedocs.io/
- Code repositoty: https://github.com/glue-lab/sqldeps
SQLDeps analyzes SQL scripts to identify:
- Dependencies: Tables and columns that must exist BEFORE query execution
- Outputs: Tables and columns permanently CREATED or MODIFIED by the query
It intelligently filters out temporary constructs like CTEs and derived tables, focusing only on the real database objects that matter.
- ๐ ๏ธ Change Management: Safely modify schemas by identifying true dependencies
- ๐พ Storage Optimization: Focus resources on essential tables and columns
- ๐ข Migration Planning: Precisely determine what needs to be migrated
- ๐ Project Documentation: Create comprehensive maps of database dependencies
pip install sqldeps
For additional functionality:
# Install with web app dependencies
pip install "sqldeps[app]"
# Install with data visualization dependencies
pip install "sqldeps[dataviz]"
# Install all optional dependencies
pip install "sqldeps[app,dataviz]"
SQLDeps provides both API and CLI interfaces:
- API: Flexible for Python developers to integrate into scripts, notebooks, or applications.
- CLI: Fast and user-friendly for analyzing files or folders directly from the command line.
from sqldeps.llm_parsers import create_extractor
# Create extractor with default settings (framework="groq", model="llama-3.3-70b-versatile")
extractor = create_extractor()
# Extract dependencies and outputs from a SQL query
sql_query = """
WITH user_orders AS (
SELECT o.user_id, COUNT(*) AS order_count
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE u.status = 'active'
GROUP BY o.user_id
)
CREATE TABLE transactions.user_order_summary AS
SELECT * FROM user_orders;
"""
result = extractor.extract_from_query(sql_query)
# Print the results
print("Dependencies:")
print(result.dependencies)
print("\nOutputs:")
print(result.outputs)
# Or extract from a file
result = extractor.extract_from_file('path/to/query.sql')
# Convert to dictionary or DataFrame
dict_format = result.to_dict()
df_format = result.to_dataframe()
# Basic example with default settings
sqldeps extract path/to/query.sql
# Specify framework and output format
sqldeps extract path/to/query.sql --framework=openai --model=gpt-4o -o results.json
# Scan a folder recursively with intelligent parallelization
sqldeps extract \
data/sql_folder \ # Automatically detect if path is file or folder
--recursive \ # Scan folder recursively
--framework=deepseek \ # Specify framework/provider
--rpm 50 # Maximum 50 requests per minute
--n-workers -1 \ # Use all available processors
-o results.csv # Output a dataframe as CSV instead of JSON
# Get help on available commands
sqldeps --help
# Get help on extract - the main command
sqldeps extract --help
SQLDeps also includes a Streamlit-based web interface:
# Run the web app
sqldeps app
Note: The web application is designed for single-file extraction and demonstration purposes. For processing multiple files or entire folders, use the API or CLI instead.
Given this SQL query:
-- Common Table Expression (CTE) to count user orders for active users
WITH user_orders AS (
SELECT o.user_id, COUNT(*) AS order_count
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE u.status = 'active'
GROUP BY o.user_id
)
-- Create a new table from the CTE
CREATE TABLE transactions.user_order_summary AS
SELECT * FROM user_orders;
SQLDeps will extract:
{
"dependencies": {
"orders": ["user_id"],
"users": ["id", "status"]
},
"outputs": {
"transactions.user_order_summary": ["*"]
}
}
Notice how:
- CTE (
user_orders
) is correctly excluded - Real source tables (
orders
,users
) are included as dependencies - Target table (
transactions.user_order_summary
) is correctly identified as output
All models available on Groq, OpenAI, and DeepSeek.
For up-to-date pricing details, please check Groq, OpenAI, DeepSeek.
You'll need to set up API keys for your chosen LLM provider. Create a .env
file in your project root:
# LLM API Keys
GROQ_API_KEY=your_groq_api_key
OPENAI_API_KEY=your_openai_api_key
DEEPSEEK_API_KEY=your_deepseek_api_key
# Database credentials (for schema validation)
DB_HOST=localhost
DB_PORT=5432
DB_NAME=mydatabase
DB_USER=username
DB_PASSWORD=password
Tip: Groq offers free tokens without requiring payment details, making it ideal for getting started quickly.
SQLDeps allows the user to match SQLDeps results (table/column dependencies and outputs) with database schemas to retrieve column data types.
from sqldeps.database import PostgreSQLConnector
from sqldeps.llm_parsers import create_extractor
# Extract dependencies
extractor = create_extractor(framework="openai", model="gpt-4o")
result = extractor.extract_from_file('query.sql')
# Connect to database and validate
conn = PostgreSQLConnector(
host="localhost",
port=5432,
database="mydatabase",
username="username"
)
# Match extracted dependencies against database schema
matching_schema = extractor.match_database_schema(
result,
db_connection=conn,
target_schemas=["public", "sales"]
)
# View validation results as a pandas DataFrame
print(matching_schema)
For custom database YAML configuration file (optional):
# database.yml
database:
host: localhost
port: 5432
database: mydatabase
username: username
password: password
You can customize the prompts used to instruct the LLM:
# Create extractor with custom prompt
extractor = create_extractor(
framework="groq",
model="llama-3.3-70b-versatile",
prompt_path="path/to/custom_prompt.yml"
)
The custom prompt YAML should include:
system_prompt: |
You are a SQL analyzer that extracts two key elements from SQL queries:
1. DEPENDENCIES: Tables and columns that must exist BEFORE query execution.
2. OUTPUTS: Tables and columns permanently CREATED or MODIFIED by the query.
# Add detailed instructions for the LLM here...
user_prompt: |
Extract SQL dependencies and outputs from this query:
{sql}
SQLDeps provides built-in visualization capabilities to help you understand complex SQL dependencies:
from sqldeps.llm_parsers import create_extractor
from sqldeps.visualization import visualize_sql_dependencies
# Create an interactive network graph from multiple SQL files
extractor = create_extractor()
sql_profiles = extractor.extract_from_folder("path/to/folder", recursive=False)
# Generate an interactive visualization (saving output to an HTML file)
figure = visualize_sql_dependencies(sql_profiles, output_path="dependencies.html")
# Show figure
figure.show()
For comprehensive documentation, including API reference and examples, visit https://sqldeps.readthedocs.io.
Contributions are welcome!
- Found a bug? Please open an issue with detailed information.
- Missing a feature? Feel free to suggest enhancements or submit a pull request.
Please check out the Contributing Guide for details.
MIT