Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824

jayzhan211 · 2025-02-22T11:53:30Z

Which issue does this PR close?

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Closes Extended sqllite tests are failing on main #14853

Rationale for this change

Instead of converting count() and count(*) to count(1). We makes count() possible as a replacement of count wildcard. In this case, count(1) can be treated as the normal case (although it is equivalent to wildcard), without this we need to handle many different complex case for count(1) such as count(cast(1 as i32)). The schema name is much more consistent with DuckDB too.

What changes are included in this PR?

Implement count with zero arg in aggregate function level.

count() -> count()
count(*) -> count()
count(1) -> count(1)
count(2) -> count(2)

Are these changes tested?

Are there any user-facing changes?

jayzhan211 · 2025-02-22T13:50:11Z

datafusion/physical-plan/src/aggregates/mod.rs

+    // handle count() case
+    if expr.is_empty() {
+        return Ok(vec![
+            Arc::new(Int64Array::from(vec![1; batch.num_rows()])) as ArrayRef


This is equivalent to count(1) case

It seems that this function is not only used by count. I'm not quite sure about the impact of this change.
Ideally, this function should not involve the logic of any specific aggregation function.

jayzhan211 · 2025-02-22T13:50:19Z

datafusion/physical-plan/src/aggregates/no_grouping.rs

-                .collect::<Result<Vec<_>>>()?;
+            // Handle count(*) case
+            let values = if expr.is_empty() {
+                vec![Arc::new(Int64Array::from(vec![1; n_rows])) as ArrayRef]


This is equivalent to count(1) case

jayzhan211 · 2025-02-24T01:21:01Z

fix the extended test in main branch

alamb · 2025-02-24T14:21:34Z

I filed #14853 and added to what this PR closes

jonahgao · 2025-02-24T14:24:11Z

datafusion/functions-aggregate/src/count.rs

@@ -148,6 +155,15 @@ impl AggregateUDFImpl for Count {
        "count"
    }

+    // In AggregateFunctionPlanner, wildcard is converted to count(1)
+    //
+    // count() -> count(1)


We still can't run select count(), count(*).

> select count(), count(*); Error during planning: Projections require unique expression names but the expression "count(*)" at position 0 and "count(*)" at position 1 have the same name. Consider aliasing ("AS") one of them

I suspect that using aliases to restore the original names is a simpler fix. I tried doing this on jonahgao@08206fd.

I think the issue here is quite different than the test covered in extended test.

duplicated schema case is executable now

query error DataFusion error: Schema error: Schema contains duplicate unqualified field name "count\(\*\)" select count(1) * count(2);

select count(), count(*) duplicated name in projection is another issue

But I agree, this query should be executable too, and I think the way to fix it is different from the duplicated schema name one

BTW I verified that both of those queries run in datafusion 44 and 45 but does not run on main. Thus this is a regression.

I agree with @jayzhan211 that the issue is different than what is causing the sqlite tests to fail in main

I have filed a ticket to track this:

Regression since 45.0.0: select count(), count(*) does not work #14855

I think they have the same root cause, which is the rewriting by AggregateFunctionPlanner and Count::schema_name() introducing duplicate names, and they could all be fixed by using aliases. The old CountWildcardRule used NamePreserver to achieve a similar effect.

alamb · 2025-02-24T15:03:19Z

I ran the sqllogictests locally and verified this patch fixes them:

INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
...
...
    Finished `release-nonlto` profile [optimized] target(s) in 1m 48s
     Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-f643b09b33355b16)
Completed 705 test files in 6 minutes
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$

alamb

Thanks @jayzhan211 -- I think this PR is an improvement (it fixes the extended tests)

However, I agree with @jonahgao that it might be better to not change the physical implementation of count() and instead rewrite count(*) to count(1) as "count(*)"

That would likely also fix #14855

alamb · 2025-02-24T15:12:59Z

datafusion/functions-aggregate/src/count.rs

@@ -550,8 +571,6 @@ impl AggregateUDFImpl for Count {
 fn is_count_wildcard(args: &[Expr]) -> bool {


this function now feels a bit redundant as it is just checking for .empty()

alamb · 2025-02-24T15:15:08Z

datafusion/sqllogictest/test_files/aggregate.slt

+2
+
+query I
+select count(1) * count(2) from t;


Could you also please add a test that shows just the values of count(2)

For example

select count(1), count(2), count(1) * count(2) from t;

alamb · 2025-02-24T15:15:28Z

datafusion/sqllogictest/test_files/aggregate.slt

+----
+4
+
+query I


Likewise here it would be nice to have count(1) and count(2) individually tested

alamb · 2025-02-24T15:16:33Z

datafusion/sql/tests/sql_integration.rs

@@ -1460,13 +1460,13 @@ fn select_simple_aggregate_with_groupby_and_column_is_in_aggregate_and_groupby()
 #[test]
 fn select_simple_aggregate_with_groupby_can_use_positions() {
    quick_test(
-        "SELECT state, age AS b, count(1) FROM person GROUP BY 1, 2",
+        "SELECT state, age AS b, count() FROM person GROUP BY 1, 2",


Why is this test changed?

alamb · 2025-02-24T15:18:31Z

datafusion/sqllogictest/test_files/count_star_rule.slt

@@ -80,12 +80,12 @@ query TT
 EXPLAIN SELECT a, COUNT() OVER (PARTITION BY a) AS count_a FROM t1;
 ----
 logical_plan
-01)Projection: t1.a, count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a
-02)--WindowAggr: windowExpr=[[count(*) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
+01)Projection: t1.a, count(Int64(1)) PARTITION BY [t1.a] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS count_a


Given you implemented support for count() I don't understand why this is this changed to count(1) (why isn't it count()?`

findepi · 2025-02-24T16:03:42Z

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Can we just give names to generated projections to avoid duplicated schema name error?

or is the problem solvable only at the physical planning level?

alamb · 2025-02-24T16:59:25Z

We convert count(constant) i.e. count(2) to count(*) in previous PR
so select count(1) * count(2) produces duplicated schema name error given both are count(*) in schema name.

Can we just give names to generated projections to avoid duplicated schema name error?

or is the problem solvable only at the physical planning level?

I think we can fix this with the generated projections (and I think it is what @jonahgao is implemented)

fix name

0af4ab9

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Feb 22, 2025

jayzhan211 changed the title ~~Fix duplicated schema name of count wildcard issue~~ Fix duplicated schema name error from count wildcard Feb 22, 2025

upd doc

7f18e05

jayzhan211 mentioned this pull request Feb 22, 2025

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner #14689

Merged

jayzhan211 requested a review from jonahgao February 22, 2025 12:08

jayzhan211 marked this pull request as draft February 22, 2025 12:20

jayzhan211 added 3 commits February 22, 2025 20:24

drop table

3ef7ddd

real count()

40385aa

clippy

a456792

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Feb 22, 2025

jayzhan211 commented Feb 22, 2025

View reviewed changes

jayzhan211 marked this pull request as ready for review February 22, 2025 13:51

jayzhan211 changed the title ~~Fix duplicated schema name error from count wildcard~~ Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard Feb 22, 2025

jayzhan211 marked this pull request as draft February 22, 2025 14:14

jayzhan211 added 3 commits February 22, 2025 22:22

fix tests

3497965

fix test

d956307

fix other tests

e24cf29

github-actions bot added sql SQL Planner optimizer Optimizer rules substrait labels Feb 22, 2025

jayzhan211 added 3 commits February 23, 2025 07:57

fix proto test

e54d4b8

fix substrait test

6ee5a35

fnt

2a2d0d3

jayzhan211 marked this pull request as ready for review February 23, 2025 03:15

jayzhan211 requested a review from alamb February 24, 2025 01:20

alamb mentioned this pull request Feb 24, 2025

Extended sqllite tests are failing on main #14853

Open

alamb mentioned this pull request Feb 24, 2025

Weekly Plan (Andrew Lamb) Feb 24, 2025 #14850

Open

9 tasks

jonahgao reviewed Feb 24, 2025

View reviewed changes

alamb mentioned this pull request Feb 24, 2025

Regression since 45.0.0: select count(), count(*) does not work #14855

Open

alamb approved these changes Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824

Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824

jayzhan211 commented Feb 22, 2025 •

edited by alamb

Loading

jayzhan211 Feb 22, 2025

jonahgao Feb 24, 2025

jayzhan211 Feb 22, 2025

jayzhan211 commented Feb 24, 2025

alamb commented Feb 24, 2025

jonahgao Feb 24, 2025

jayzhan211 Feb 24, 2025 •

edited

Loading

alamb Feb 24, 2025

jonahgao Feb 24, 2025 •

edited

Loading

alamb commented Feb 24, 2025

alamb left a comment

alamb Feb 24, 2025

alamb Feb 24, 2025

alamb Feb 24, 2025

alamb Feb 24, 2025

alamb Feb 24, 2025

findepi commented Feb 24, 2025

alamb commented Feb 24, 2025

		@@ -550,8 +571,6 @@ impl AggregateUDFImpl for Count {
		fn is_count_wildcard(args: &[Expr]) -> bool {

Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824

Are you sure you want to change the base?

Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard #14824

Conversation

jayzhan211 commented Feb 22, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Feb 24, 2025

alamb commented Feb 24, 2025

Choose a reason for hiding this comment

jayzhan211 Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonahgao Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

alamb commented Feb 24, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi commented Feb 24, 2025

alamb commented Feb 24, 2025

jayzhan211 commented Feb 22, 2025 •

edited by alamb

Loading

jayzhan211 Feb 24, 2025 •

edited

Loading

jonahgao Feb 24, 2025 •

edited

Loading