Address performance/execution plan of TPCH query 9 #77

Dandandan · 2021-04-25T20:13:18Z

Describe the bug
TPC-H Query 9 consumes a lot of memory and takes ~8s in memory on a 8 core machine on SF=1 in memory (with 100%cpu usage across cores) to execute. That likely has to do with the plan misses a equi-join and turns it into a cross join.

To Reproduce
Run query 9, check plan and execution time.

Expected behavior
The query is expected to finish faster (<1s) and shouldn't need a cross join.

Additional context
N/A

pjmore · 2021-05-13T18:19:50Z

So I took a look at this and I have two solutions one which I believe always finds all possible inner joins but runs in N^2 and one that works for this case and should work for most others that runs in N LogN where N is the number of plans in the select statement. Is the N^2 runtime and the added petgraph dependency alright? I don't want datafusion to show up on Accidentally quadratic because of me haha. I've attached a brief description of each method incase there are any obvious improvements that can be made.

Thorough version

Code

Create undirected graph of plans where each edge represents a join connection.
Find the connected components of the graph.
For each connected component find the plan with the schema that has the fewest number of columns.
Cross join all of the previously selected plans into a new node, removing the selected nodes from the graph.
Starting from an arbitrary node, traverse the graph performing inner joins on all remaining nodes.

Less thorough version

Code

Iterate over plans counting the number of join key hits in each schema.
Sort plans so that plans that have no key hits are at the beginning and then the vector is sorted in descending order. E.g. the array (0,0,1,2,3,4) would be sorted to (0,0, 4,3,2,1). The rationale being that cross joins should prefer to run early to limit the amount of data that needs to be duplicated and that there are more likely to be joins to plans with many join keys.
Build the plan as was done previously.

The less thorough version was able to find the inner join that was missed for query 9 but there is still the possibility to miss inner joins. For example if the tables have the join relations as seen below, the order of the plans after sorting would be (B,F,D,A,C,E,G). Which means that the plan would generate a cross join between B and F as they share no join key pairs, missing the inner join through D.

┌─┐                ┌─┐
│A│                │E│
└┬┘                └┬┘
 │                  │
┌┴┐      ┌─┐       ┌┴┐
│B├──────┤D├───────┤F│
└┬┘      └─┘       └┬┘
 │                  │
┌┴┐                ┌┴┐
│C│                │G│
└─┘                └─┘

andygrove · 2023-01-04T14:45:53Z

I believe this is now resolved so will close this. Feel free to reopen if I am mistaken.

Here is the logical plan for q9 in master:

Sort: profit.nation ASC NULLS LAST, profit.o_year DESC NULLS FIRST
  Projection: profit.nation, profit.o_year, SUM(profit.amount) AS sum_profit
    Aggregate: groupBy=[[profit.nation, profit.o_year]], aggr=[[SUM(profit.amount)]]
      SubqueryAlias: profit
        Projection: nation.n_name AS nation, datepart(Utf8("YEAR"), orders.o_orderdate) AS o_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) - CAST(partsupp.ps_supplycost * lineitem.l_quantity AS Decimal128(38, 4)) AS amount
          Projection: lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, partsupp.ps_supplycost, orders.o_orderdate, nation.n_name
            Inner Join: supplier.s_nationkey = nation.n_nationkey
              Inner Join: lineitem.l_orderkey = orders.o_orderkey
                Inner Join: lineitem.l_suppkey = partsupp.ps_suppkey, lineitem.l_partkey = partsupp.ps_partkey
                  Inner Join: lineitem.l_suppkey = supplier.s_suppkey
                    Inner Join: part.p_partkey = lineitem.l_partkey
                      Filter: part.p_name LIKE Utf8("%green%")
                        TableScan: part projection=[p_partkey, p_name]
                      TableScan: lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_quantity, l_extendedprice, l_discount]
                    TableScan: supplier projection=[s_suppkey, s_nationkey]
                  TableScan: partsupp projection=[ps_partkey, ps_suppkey, ps_supplycost]
                TableScan: orders projection=[o_orderkey, o_orderdate]
              TableScan: nation projection=[n_nationkey, n_name]

Dandandan added the bug Something isn't working label Apr 25, 2021

pjmore mentioned this issue Nov 28, 2021

sql planner creates cross join instead of inner join from select predicates #1293

Closed

andygrove closed this as completed Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address performance/execution plan of TPCH query 9 #77

Address performance/execution plan of TPCH query 9 #77

Dandandan commented Apr 25, 2021 •

edited

Loading

pjmore commented May 13, 2021

andygrove commented Jan 4, 2023

Address performance/execution plan of TPCH query 9 #77

Address performance/execution plan of TPCH query 9 #77

Comments

Dandandan commented Apr 25, 2021 • edited Loading

pjmore commented May 13, 2021

Thorough version

Less thorough version

andygrove commented Jan 4, 2023

Dandandan commented Apr 25, 2021 •

edited

Loading