Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: just allow having use expr in groupby or aggr #4579

Merged
merged 2 commits into from
Dec 12, 2022

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented Dec 10, 2022

Which issue does this PR close?

Part of #4556 .

having just use expr in groupby/aggr

Rationale for this change

pg

-- create
CREATE TABLE EMPLOYEE (
  empId INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  dept TEXT NOT NULL
);

-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Clark', 'Sales');

-- fetch 
SELECT empid FROM EMPLOYEE having empid = 0;

psql:commands.sql:15: ERROR:  column "employee.empid" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT empid FROM EMPLOYEE having empid = 0;
               ^

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sql SQL Planner labels Dec 10, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jackwener

I am not sure I would call this an "API break" -- more like a bugfix were DataFusion was previously allowing malformed SQL. 😆

I had one negative case I think we may need to check but otherwise the code looks good to me 👍

@@ -194,7 +194,7 @@ async fn csv_query_group_by_and_having_and_where() -> Result<()> {
async fn csv_query_having_without_group_by() -> Result<()> {
let ctx = SessionContext::new();
register_aggregate_csv(&ctx).await?;
let sql = "SELECT c1, c2, c3 FROM aggregate_test_100 HAVING c2 >= 4 AND c3 > 90";
let sql = "SELECT c1, c2, c3 FROM aggregate_test_100 where c2 >= 4 AND c3 > 90";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the whole test as we definitely have coverage for selecting with predicates in the where clause and the name of the test doesn't make sense csv_query_having_without_group_by

\n Filter: person.age > Int64(100) AND person.age < Int64(200)\
\n TableScan: person";
quick_test(sql, expected);
let err = logical_plan(sql).expect_err("query should have failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend one test (that should also fail) where there is a GROUP clause but the having clause refers to an invalid column, such as

SELECT id, MAX(age)
FROM PERSON
GROUP BY id
-- first_name is a valid column but does not appear in the grouping output (group column nor aggregate)
HAVING first_name = 'M'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice advice❤️
Added it.

.is_empty()
|| !aggr_exprs.is_empty()
{
self.aggregate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to check here that the having expr only uses columns in the logical aggregation phase (group by columns or agg exprs)

Copy link
Member Author

@jackwener jackwener Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been checked by check_columns_satisfy_exprs().

@jackwener jackwener changed the title API-break: just allow having use expr in groupby or aggr bugfix: just allow having use expr in groupby or aggr Dec 11, 2022
@jackwener
Copy link
Member Author

Thanks @alamb review, very helpful advice❤️.
has resolved them

HAVING first_name = 'M'";
let err = logical_plan(sql).expect_err("query should have failed");
assert_eq!(
"Plan(\"HAVING clause references non-aggregate values: Expression person.first_name could not be resolved from available columns: person.id, MAX(person.age)\")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@alamb
Copy link
Contributor

alamb commented Dec 12, 2022

Thanks @jackwener

@alamb alamb merged commit d33457c into apache:master Dec 12, 2022
@mingmwang
Copy link
Contributor

@jackwener Could you please add a UT to cover this case ?

SELECT id FROM person GROUP BY id ORDER BY MAX(age);

Looks like this is allowed in PG and SparkSQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants