Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address performance/execution plan of TPCH query 9 #77

Closed
Dandandan opened this issue Apr 25, 2021 · 2 comments
Closed

Address performance/execution plan of TPCH query 9 #77

Dandandan opened this issue Apr 25, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Apr 25, 2021

Describe the bug
TPC-H Query 9 consumes a lot of memory and takes ~8s in memory on a 8 core machine on SF=1 in memory (with 100%cpu usage across cores) to execute. That likely has to do with the plan misses a equi-join and turns it into a cross join.

To Reproduce
Run query 9, check plan and execution time.

Expected behavior
The query is expected to finish faster (<1s) and shouldn't need a cross join.

Additional context
N/A

@Dandandan Dandandan added the bug Something isn't working label Apr 25, 2021
@pjmore
Copy link
Contributor

pjmore commented May 13, 2021

So I took a look at this and I have two solutions one which I believe always finds all possible inner joins but runs in N^2 and one that works for this case and should work for most others that runs in N LogN where N is the number of plans in the select statement. Is the N^2 runtime and the added petgraph dependency alright? I don't want datafusion to show up on Accidentally quadratic because of me haha. I've attached a brief description of each method incase there are any obvious improvements that can be made.

Thorough version

Code

  1. Create undirected graph of plans where each edge represents a join connection.
  2. Find the connected components of the graph.
  3. For each connected component find the plan with the schema that has the fewest number of columns.
  4. Cross join all of the previously selected plans into a new node, removing the selected nodes from the graph.
  5. Starting from an arbitrary node, traverse the graph performing inner joins on all remaining nodes.

Less thorough version

Code

  1. Iterate over plans counting the number of join key hits in each schema.
  2. Sort plans so that plans that have no key hits are at the beginning and then the vector is sorted in descending order. E.g. the array (0,0,1,2,3,4) would be sorted to (0,0, 4,3,2,1). The rationale being that cross joins should prefer to run early to limit the amount of data that needs to be duplicated and that there are more likely to be joins to plans with many join keys.
  3. Build the plan as was done previously.

The less thorough version was able to find the inner join that was missed for query 9 but there is still the possibility to miss inner joins. For example if the tables have the join relations as seen below, the order of the plans after sorting would be (B,F,D,A,C,E,G). Which means that the plan would generate a cross join between B and F as they share no join key pairs, missing the inner join through D.

┌─┐                ┌─┐
│A│                │E│
└┬┘                └┬┘
 │                  │
┌┴┐      ┌─┐       ┌┴┐
│B├──────┤D├───────┤F│
└┬┘      └─┘       └┬┘
 │                  │
┌┴┐                ┌┴┐
│C│                │G│
└─┘                └─┘

@andygrove
Copy link
Member

I believe this is now resolved so will close this. Feel free to reopen if I am mistaken.

Here is the logical plan for q9 in master:

Sort: profit.nation ASC NULLS LAST, profit.o_year DESC NULLS FIRST
  Projection: profit.nation, profit.o_year, SUM(profit.amount) AS sum_profit
    Aggregate: groupBy=[[profit.nation, profit.o_year]], aggr=[[SUM(profit.amount)]]
      SubqueryAlias: profit
        Projection: nation.n_name AS nation, datepart(Utf8("YEAR"), orders.o_orderdate) AS o_year, CAST(lineitem.l_extendedprice AS Decimal128(38, 4)) * CAST(Decimal128(Some(100),23,2) - CAST(lineitem.l_discount AS Decimal128(23, 2)) AS Decimal128(38, 4)) - CAST(partsupp.ps_supplycost * lineitem.l_quantity AS Decimal128(38, 4)) AS amount
          Projection: lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, partsupp.ps_supplycost, orders.o_orderdate, nation.n_name
            Inner Join: supplier.s_nationkey = nation.n_nationkey
              Inner Join: lineitem.l_orderkey = orders.o_orderkey
                Inner Join: lineitem.l_suppkey = partsupp.ps_suppkey, lineitem.l_partkey = partsupp.ps_partkey
                  Inner Join: lineitem.l_suppkey = supplier.s_suppkey
                    Inner Join: part.p_partkey = lineitem.l_partkey
                      Filter: part.p_name LIKE Utf8("%green%")
                        TableScan: part projection=[p_partkey, p_name]
                      TableScan: lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_quantity, l_extendedprice, l_discount]
                    TableScan: supplier projection=[s_suppkey, s_nationkey]
                  TableScan: partsupp projection=[ps_partkey, ps_suppkey, ps_supplycost]
                TableScan: orders projection=[o_orderkey, o_orderdate]
              TableScan: nation projection=[n_nationkey, n_name]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants