Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Feb 22, 2025

Which issue does this PR close?

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Rationale for this change

Instead of converting count() and count(*) to count(1). We makes count() possible as a replacement of count wildcard. In this case, count(1) can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case for count(1) such as count(cast(1 as i32)). The schema name is much more consistent with DuckDB too.

What changes are included in this PR?

Implement count with zero arg in aggregate function level.

count() -> count()
count(*) -> count()
count(1) -> count(1)
count(2) -> count(2)

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Feb 22, 2025
@jayzhan211 jayzhan211 changed the title Fix duplicated schema name of count wildcard issue Fix duplicated schema name error from count wildcard Feb 22, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Feb 22, 2025
// handle count() case
if expr.is_empty() {
return Ok(vec![
Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to count(1) case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is not only used by count. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.

.collect::<Result<Vec<_>>>()?;
// Handle count(*) case
let values = if expr.is_empty() {
vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to count(1) case

@jayzhan211 jayzhan211 marked this pull request as ready for review February 22, 2025 13:51
@jayzhan211 jayzhan211 changed the title Fix duplicated schema name error from count wildcard Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard Feb 22, 2025
@jayzhan211 jayzhan211 marked this pull request as draft February 22, 2025 14:14
@github-actions github-actions bot added sql SQL Planner optimizer Optimizer rules substrait labels Feb 22, 2025
@jayzhan211 jayzhan211 marked this pull request as ready for review February 23, 2025 03:15
@jayzhan211 jayzhan211 requested a review from alamb February 24, 2025 01:20
@jayzhan211
Copy link
Contributor Author

fix the extended test in main branch

@alamb
Copy link
Contributor

alamb commented Feb 24, 2025

I filed #14853 and added to what this PR closes

@@ -148,6 +155,15 @@ impl AggregateUDFImpl for Count {
"count"
}

// In AggregateFunctionPlanner, wildcard is converted to count(1)
//
// count() -> count(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still can't run select count(), count(*).

> select count(), count(*);
Error during planning: Projections require unique expression names but the expression "count(*)" at position 0 and "count(*)" at position 1 have the same name. Consider aliasing ("AS") one of them

I suspect that using aliases to restore the original names is a simpler fix. I tried doing this on jonahgao@08206fd.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is quite different than the test covered in extended test.

duplicated schema case is executable now

query error DataFusion error: Schema error: Schema contains duplicate unqualified field name "count\(\*\)"
select count(1) * count(2);

select count(), count(*) duplicated name in projection is another issue

But I agree, this query should be executable too, and I think the way to fix it is different from the duplicated schema name one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I verified that both of those queries run in datafusion 44 and 45 but does not run on main. Thus this is a regression.

I agree with @jayzhan211 that the issue is different than what is causing the sqlite tests to fail in main

I have filed a ticket to track this:

Copy link
Member

@jonahgao jonahgao Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they have the same root cause, which is the rewriting by AggregateFunctionPlanner and Count::schema_name() introducing duplicate names, and they could all be fixed by using aliases. The old CountWildcardRule used NamePreserver to achieve a similar effect.

@alamb
Copy link
Contributor

alamb commented Feb 24, 2025

I ran the sqllogictests locally and verified this patch fixes them:

INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
...
...
    Finished `release-nonlto` profile [optimized] target(s) in 1m 48s
     Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-f643b09b33355b16)
Completed 705 test files in 6 minutes
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I think this PR is an improvement (it fixes the extended tests)

However, I agree with @jonahgao that it might be better to not change the physical implementation of count() and instead rewrite count(*) to count(1) as "count(*)"

That would likely also fix #14855

@@ -550,8 +571,6 @@ impl AggregateUDFImpl for Count {
fn is_count_wildcard(args: &[Expr]) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function now feels a bit redundant as it is just checking for .empty()

2

query I
select count(1) * count(2) from t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a test that shows just the values of count(2)

For example

select count(1), count(2), count(1) * count(2) from t;

----
4

query I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here it would be nice to have count(1) and count(2) individually tested

@@ -1460,13 +1460,13 @@ fn select_simple_aggregate_with_groupby_and_column_is_in_aggregate_and_groupby()
#[test]
fn select_simple_aggregate_with_groupby_can_use_positions() {
quick_test(
"SELECT state, age AS b, count(1) FROM person GROUP BY 1, 2",
"SELECT state, age AS b, count() FROM person GROUP BY 1, 2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test changed?

@@ -80,12 +80,12 @@ query TT
EXPLAIN SELECT a, COUNT() OVER (PARTITION BY a) AS count_a FROM t1;
----
logical_plan
01)Projection: t1.a, count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a
02)--WindowAggr: windowExpr=[[count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
01)Projection: t1.a, count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you implemented support for count() I don't understand why this is this changed to count(1) (why isn't it count()?`

@findepi
Copy link
Member

findepi commented Feb 24, 2025

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Can we just give names to generated projections to avoid duplicated schema name error?

or is the problem solvable only at the physical planning level?

@alamb
Copy link
Contributor

alamb commented Feb 24, 2025

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Can we just give names to generated projections to avoid duplicated schema name error?

or is the problem solvable only at the physical planning level?

I think we can fix this with the generated projections (and I think it is what @jonahgao is implemented)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extended sqllite tests are failing on main
4 participants