Support join-filter pushdown for semi/anti join #4923

ygf11 · 2023-01-16T12:35:36Z

Which issue does this PR close?

Relative to #4413

Rationale for this change

What changes are included in this PR?

Enable filter pushdown semi/anti join.
Add tests.

Are these changes tested?

Are there any user-facing changes?

mingmwang · 2023-01-17T06:38:30Z

is this PR ready for review now?

I'm working on another issue which might be related to the things that you are working on. When decorrelates the where Exists/IN Subquery to Joins, there is some bug here and the filter was added to wrong side of the plan

#4366

ygf11 · 2023-01-17T12:07:02Z

cc: @mingmwang @alamb

alamb

Thanks @ygf11

alamb · 2023-01-17T20:40:50Z

datafusion/core/tests/sqllogictests/test_files/subquery.slt

@@ -0,0 +1,62 @@
+# Licensed to the Apache Software Foundation (ASF) under one


alamb · 2023-01-17T21:38:35Z

datafusion/optimizer/src/push_down_filter.rs

-                // filter_push_down does not yet support SEMI/ANTI joins with join conditions
-                Ok((false, false))
-            }
+            JoinType::LeftSemi | JoinType::RightSemi => Ok((true, true)),


I thought a Semi join had the same semantics as Left join -- namely that the left/right side were preserved but the others weren't

So I would expect this to be something like

Suggested change

JoinType::LeftSemi | JoinType::RightSemi => Ok((true, true)),

JoinType::LeftSemi => Ok((false, true)),

JoinType::RightSemi => Ok((true, false)),

Maybe I am mis understanding the semi join

I played around with it in postgres:

postgres=# create table bar as values (2, 'a'), (3, 'b'); SELECT 2 postgres=# create table foo as values (1, 'X'), (2, 'Y'), (3, 'Z'); SELECT 3 postgres=# explain select * from foo SEMI JOIN bar USING (column1) where bar.column2 != 'b' ; QUERY PLAN ------------------------------------------------------------------------- Merge Join (cost=179.17..305.88 rows=8026 width=68) Merge Cond: (bar.column1 = semi.column1) -> Sort (cost=90.99..94.15 rows=1264 width=36) Sort Key: bar.column1 -> Seq Scan on bar (cost=0.00..25.88 rows=1264 width=36) Filter: (column2 <> 'b'::text) -> Sort (cost=88.17..91.35 rows=1270 width=36) Sort Key: semi.column1 -> Seq Scan on foo semi (cost=0.00..22.70 rows=1270 width=36) (9 rows) postgres=# select * from foo SEMI JOIN bar USING (column1); column1 | column2 | column2 ---------+---------+--------- 2 | Y | a 3 | Z | b (2 rows) postgres=# select * from foo SEMI JOIN bar USING (column1) where bar.column2 != 'b' ; column1 | column2 | column2 ---------+---------+--------- 2 | Y | a (1 row)

And this this looks correct to me

Update 🤦 the postgres foo SEMI JOIN bar is being interpreted as though SEMI is an alias, not a different join type

alamb · 2023-01-17T21:49:18Z

datafusion/optimizer/src/push_down_filter.rs

+        \n  Filter: test1.b > UInt32(1)\
+        \n    TableScan: test1\
+        \n  Projection: test2.a, test2.b\
+        \n    Filter: test2.b > UInt32(2)\


as above I don't think this is correct -- but I think we would need to review the implementation in spark

I think this method's name on_lr_is_preserved() is very miss leading. It means whether the Join condition can be changed to a Filter condition and push down. For example, for the Full Out Join, both sides should be preserved in the join result, but join conditions can not be changed to Filter and pushed down, and the method returns (false, false).

And per my understanding, LeftSemi/Right Semi join are similar to the Inner join case, so the implement should return (true, true).

mingmwang · 2023-01-18T08:15:42Z

@alamb @ygf11

Just share you the SparkSQL's result:

explain extended
SELECT t1_id, t1_name FROM t1 LEFT SEMI JOIN t2 ON (t1_id = t2_id and t2_id >= 100 and t1_id >= 100);

== Optimized Logical Plan ==
Join LeftSemi, (t1_id#1987050 = t2_id#1987346), Statistics(sizeInBytes=1.0 B)
:- Filter (isnotnull(t1_id#1987050) AND (t1_id#1987050 >= 100)), Statistics(sizeInBytes=1.0 B)
:  +- Relation access_views.t1[t1_id#1987050,t1_name#1987051] parquet, Statistics(sizeInBytes=0.0 B)
+- Project [t2_id#1987346], Statistics(sizeInBytes=1.0 B)
   +- Filter (isnotnull(t2_id#1987346) AND (t2_id#1987346 >= 100)), Statistics(sizeInBytes=1.0 B)
      +- Relation access_views.t2[t2_id#1987346,t2_name#1987347] parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [t1_id#1987050], [t2_id#1987346], LeftSemi, BuildRight, false
   :- Project [t1_id#1987050, t1_name#1987051]
   :  +- Filter (isnotnull(t1_id#1987050) AND (t1_id#1987050 >= 100))
   :     +- FileScan parquet access_views.t1[t1_id#1987050,t1_name#1987051] Batched: true, DataFilters: [isnotnull(t1_id#1987050), (t1_id#1987050 >= 100)], Format: Parquet, Location: InMemoryFileIndex[viewfs://hermes-rno/tmp/spark/[email protected]/temp-d6be7dc2-736c-4205-8b..., PartitionFilters: [], PushedFilters: [IsNotNull(t1_id), GreaterThanOrEqual(t1_id,100)], ReadSchema: struct<t1_id:int,t1_name:string>, UsedIndexes: []
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#301830]
      +- Project [t2_id#1987346]
         +- Filter (isnotnull(t2_id#1987346) AND (t2_id#1987346 >= 100))
            +- FileScan parquet access_views.t2[t2_id#1987346] Batched: true, DataFilters: [isnotnull(t2_id#1987346), (t2_id#1987346 >= 100)], Format: Parquet, Location: InMemoryFileIndex[viewfs://hermes-rno/tmp/spark/[email protected]/temp-d6be7dc2-736c-4205-8b..., PartitionFilters: [], PushedFilters: [IsNotNull(t2_id), GreaterThanOrEqual(t2_id,100)], ReadSchema: struct<t2_id:int>, UsedIndexes: []

mingmwang · 2023-01-18T08:20:32Z

Except for the method name on_lr_is_preserved, this PR LGTM.

ygf11 · 2023-01-18T10:08:45Z

@alamb I guess you are mislead by the title of this pr.
This pr is to support pushdown for filters in join, not filter before join, and pushdown filter before join has already done. The title is fixed.

I also run this in the spark and get the same result with @mingmwang.
Thanks @mingmwang for review.

mingmwang · 2023-01-18T10:23:01Z

@ygf11
BTW, just let you know that SparkSQL also had some issues within the InferFilters and Filter push down logic, the LeftSemi/LeftAnti/IN_EXISTS Subquery push down logic is not consistent.

You can have a try.

-- left semi join, t2.id filter push down, t1 filter push down
explain extended 
SELECT t1.id, t1.name FROM t1 LEFT SEMI JOIN t2 ON (t1.id = t2.id and t2.id > 11);

-- left semi join, t1 filter push down, t2.id filter not push down ???
explain extended 
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.id = t2.id and t1.id > 11);

-- left semi join, t2.id filter push down, t1 filter not push down ???
explain extended 
SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t1.id = t2.id and t2.id > 11);

alamb · 2023-01-18T15:25:28Z

If @mingmwang is happy with these semantics I think it would be fine to merge

I think my confusion stems from the fact that a SEMI join semantically only outputs columns from one relation and also the WHERE clause is evaluated after all joins (semantically)

So a query like this makes sense:

SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id) WHERE t1.value = 5

But a query like this (predicate on t2 after the semi join)

SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id) WHERE t2.value = 5

Does not make sense as t2.value does not appear in the query output

Maybe we can add some error cases shpwing that SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id) WHERE t2.value = 5 generates an error

It would also be good to cover such a predicate in the ON clause (which does make sense and could be pushed down I think):

SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id AND t2.value = 5)

Thanks for this pr @ygf11

ygf11 · 2023-01-19T11:22:08Z

Maybe we can add some error cases shpwing that SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id) WHERE t2.value = 5 generates an error

It would also be good to cover such a predicate in the ON clause (which does make sense and could be pushed down I think):

SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id AND t2.value = 5)

Added.

alamb

Looks good -- thank you @ygf11

alamb · 2023-01-19T14:43:38Z

datafusion/core/tests/sqllogictests/test_files/join.slt

+(44, 'x', 3),
+(55, 'w', 3);
+
+# left semi with wrong where clause


ursabot · 2023-01-19T14:52:18Z

Benchmark runs are scheduled for baseline = 19f6f19 and contender = dde23ef. dde23ef is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Support filter pushdown for semi/anti join

3a2d778

github-actions bot added core Core DataFusion crate optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 16, 2023

ygf11 added 2 commits January 17, 2023 05:27

improve comment

9882109

add subquery.slt for sqllogictests

0c72aab

ygf11 marked this pull request as ready for review January 17, 2023 12:05

alamb reviewed Jan 17, 2023

View reviewed changes

ygf11 changed the title ~~Support filter pushdown for semi/anti join~~ Support join-filter pushdown for semi/anti join Jan 18, 2023

ygf11 added 3 commits January 19, 2023 04:25

Merge remote-tracking branch 'upstream/master' into filter-pushdown

85b8fdb

add tests

f1e6c17

improve test

f66a492

alamb approved these changes Jan 19, 2023

View reviewed changes

datafusion/core/tests/sqllogictests/test_files/join.slt

(44, 'x', 3),

(55, 'w', 3);

# left semi with wrong where clause

Copy link

Contributor

alamb Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

alamb merged commit dde23ef into apache:master Jan 19, 2023

ygf11 deleted the filter-pushdown branch January 20, 2023 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support join-filter pushdown for semi/anti join #4923

Support join-filter pushdown for semi/anti join #4923

ygf11 commented Jan 16, 2023

mingmwang commented Jan 17, 2023

ygf11 commented Jan 17, 2023

alamb left a comment

alamb Jan 17, 2023

alamb Jan 17, 2023

alamb Jan 17, 2023

alamb Jan 18, 2023

alamb Jan 17, 2023

mingmwang Jan 18, 2023

mingmwang commented Jan 18, 2023

mingmwang commented Jan 18, 2023

ygf11 commented Jan 18, 2023 •

edited

Loading

mingmwang commented Jan 18, 2023

alamb commented Jan 18, 2023

ygf11 commented Jan 19, 2023

alamb left a comment

alamb Jan 19, 2023

ursabot commented Jan 19, 2023

		@@ -0,0 +1,62 @@
		# Licensed to the Apache Software Foundation (ASF) under one

	JoinType::LeftSemi \| JoinType::RightSemi => Ok((true, true)),
	JoinType::LeftSemi => Ok((false, true)),
	JoinType::RightSemi => Ok((true, false)),

Support join-filter pushdown for semi/anti join #4923

Support join-filter pushdown for semi/anti join #4923

Conversation

ygf11 commented Jan 16, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mingmwang commented Jan 17, 2023

ygf11 commented Jan 17, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 17, 2023

Choose a reason for hiding this comment

alamb Jan 17, 2023

Choose a reason for hiding this comment

alamb Jan 17, 2023

Choose a reason for hiding this comment

alamb Jan 18, 2023

Choose a reason for hiding this comment

alamb Jan 17, 2023

Choose a reason for hiding this comment

mingmwang Jan 18, 2023

Choose a reason for hiding this comment

mingmwang commented Jan 18, 2023

mingmwang commented Jan 18, 2023

ygf11 commented Jan 18, 2023 • edited Loading

mingmwang commented Jan 18, 2023

alamb commented Jan 18, 2023

ygf11 commented Jan 19, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 19, 2023

Choose a reason for hiding this comment

ursabot commented Jan 19, 2023

ygf11 commented Jan 18, 2023 •

edited

Loading