Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regex in extract_sql to correctly handle SQL containing CTEs (without matching WITHIN) #512

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

mahendrasinghbora
Copy link
Contributor

Pull Request Description:

This pull request addresses an issue in the extract_sql function where the regex used to extract SQL queries from LLM responses incorrectly matches the keyword "WITHIN". This results in incomplete and invalid SQL queries being extracted. The regex has been updated to use word boundaries to ensure accurate matching of the "WITH" keyword at the start of CTEs. This fix resolves the syntax errors encountered when running the extracted SQL queries.

Related Previous Pull Requests:

Issue:
When a response contains a SQL query with the keyword "WITHIN", the regex mistakenly captures partial SQL statements, resulting in invalid SQL. An example of such an SQL query is provided below:

Example SQL:

SELECT
    CASE
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) < 20 THEN '< 20'
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 20 AND 29 THEN '20-29'
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 30 AND 39 THEN '30-39'
        WHEN EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob)) BETWEEN 40 AND 49 THEN '40-49'
        ELSE '50+'
    END AS age_group,
    COUNT(DISTINCT customer.customer_id) AS customer_count,
    SUM(invoice.invoice_amount) AS total_spent,
    COUNT(invoice.invoice_id) AS total transactions,
    AVG(invoice.invoice_amount) AS average spent per transaction,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY invoice.invoice_amount) AS median_spent
FROM
    customer
    LEFT JOIN invoice ON customer.customer_id = invoice.customer_id
GROUP BY
    age_group;

Extracted SQL before fix:

WITHIN GROUP (ORDER BY invoice.invoice_amount) AS median_spent
FROM
    customer
    LEFT JOIN invoice ON customer.customer_id = invoice.customer_id
GROUP BY
    age_group;

Solution:
The regex was updated to use word boundaries to correctly match the keyword "WITH" at the start of a CTE, ensuring accurate extraction of SQL statements.

Updated regex:

sqls = re.findall(r"\bWITH\b .*?;", llm_response, re.DOTALL)

The current fix ensures that SQL extraction handles CTEs properly without misinterpreting "WITHIN" as a CTE starter.

@zainhoda zainhoda merged commit dce3186 into vanna-ai:main Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants