-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824
base: main
Are you sure you want to change the base?
Conversation
// handle count() case | ||
if expr.is_empty() { | ||
return Ok(vec![ | ||
Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to count(1)
case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this function is not only used by count
. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.
.collect::<Result<Vec<_>>>()?; | ||
// Handle count(*) case | ||
let values = if expr.is_empty() { | ||
vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to count(1) case
fix the extended test in main branch |
I filed #14853 and added to what this PR closes |
@@ -148,6 +155,15 @@ impl AggregateUDFImpl for Count { | |||
"count" | |||
} | |||
|
|||
// In AggregateFunctionPlanner, wildcard is converted to count(1) | |||
// | |||
// count() -> count(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still can't run select count(), count(*)
.
> select count(), count(*);
Error during planning: Projections require unique expression names but the expression "count(*)" at position 0 and "count(*)" at position 1 have the same name. Consider aliasing ("AS") one of them
I suspect that using aliases to restore the original names is a simpler fix. I tried doing this on jonahgao@08206fd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is quite different than the test covered in extended test.
duplicated schema case is executable now
query error DataFusion error: Schema error: Schema contains duplicate unqualified field name "count\(\*\)"
select count(1) * count(2);
select count(), count(*)
duplicated name in projection is another issue
But I agree, this query should be executable too, and I think the way to fix it is different from the duplicated schema name one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I verified that both of those queries run in datafusion
44 and 45 but does not run on main. Thus this is a regression.
I agree with @jayzhan211 that the issue is different than what is causing the sqlite tests to fail in main
I have filed a ticket to track this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they have the same root cause, which is the rewriting by AggregateFunctionPlanner
and Count::schema_name()
introducing duplicate names, and they could all be fixed by using aliases. The old CountWildcardRule
used NamePreserver
to achieve a similar effect.
I ran the sqllogictests locally and verified this patch fixes them: INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
...
...
Finished `release-nonlto` profile [optimized] target(s) in 1m 48s
Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-f643b09b33355b16)
Completed 705 test files in 6 minutes
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- I think this PR is an improvement (it fixes the extended tests)
However, I agree with @jonahgao that it might be better to not change the physical implementation of count()
and instead rewrite count(*)
to count(1) as "count(*)"
That would likely also fix #14855
@@ -550,8 +571,6 @@ impl AggregateUDFImpl for Count { | |||
fn is_count_wildcard(args: &[Expr]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function now feels a bit redundant as it is just checking for .empty()
2 | ||
|
||
query I | ||
select count(1) * count(2) from t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a test that shows just the values of count(2)
For example
select count(1), count(2), count(1) * count(2) from t;
---- | ||
4 | ||
|
||
query I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here it would be nice to have count(1)
and count(2)
individually tested
@@ -1460,13 +1460,13 @@ fn select_simple_aggregate_with_groupby_and_column_is_in_aggregate_and_groupby() | |||
#[test] | |||
fn select_simple_aggregate_with_groupby_can_use_positions() { | |||
quick_test( | |||
"SELECT state, age AS b, count(1) FROM person GROUP BY 1, 2", | |||
"SELECT state, age AS b, count() FROM person GROUP BY 1, 2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test changed?
@@ -80,12 +80,12 @@ query TT | |||
EXPLAIN SELECT a, COUNT() OVER (PARTITION BY a) AS count_a FROM t1; | |||
---- | |||
logical_plan | |||
01)Projection: t1.a, count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a | |||
02)--WindowAggr: windowExpr=[[count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] | |||
01)Projection: t1.a, count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you implemented support for count()
I don't understand why this is this changed to count(1)
(why isn't it count()?`
Can we just give names to generated projections to avoid duplicated schema name error? or is the problem solvable only at the physical planning level? |
I think we can fix this with the generated projections (and I think it is what @jonahgao is implemented) |
Which issue does this PR close?
We convert count(constant) i.e. count(2) to count(*) in previous PR
so
select count(1) * count(2)
produces duplicated schema name error given both arecount(*)
in schema name.Rationale for this change
Instead of converting
count()
andcount(*)
tocount(1)
. We makescount()
possible as a replacement of count wildcard. In this case,count(1)
can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case forcount(1)
such ascount(cast(1 as i32))
. The schema name is much more consistent with DuckDB too.What changes are included in this PR?
Implement count with zero arg in aggregate function level.
count()
-> count()count(*)
-> count()count(1)
-> count(1)count(2)
-> count(2)Are these changes tested?
Are there any user-facing changes?