Use `DESCRIBE ANALYZE` / `EXPLAIN ANALYZE` to display stats data about joins. #2248

nicktobey · 2024-01-10T00:06:39Z

(In MySQL, DESCRIBE and EXPLAIN are aliases. GMS pretty consistently uses DESCRIBE internally, so I prefer that here.)

This PR has a couple different pieces:

`sql.DescribeStats`

The most important part of this PR is the mixin struct sql.DescribeStats, defined in sql/describe.go. By embedding this struct within a Node or Expression, the node will store information about the cost estimate and row count estimate of the plan that was used to generate the node. This information can then be included in the Node/Expressions's string representation by calling the method DescribeStats.GetDescribeStatsString.

Currently, only Join nodes embed this mixin.

There's two ways to surface this information during a query.

DESCRIBE format=estimates SELECT ... will include the row estimates and cost estimates of every node that supports this.
DESCRIBE ANALYZE SELECT ... will execute the plan, and then display the actual row counts alongside the estimates.

Note that DESCRIBE ANALYZE is only permitted for queries that don't have side effects. For instance, DESCRIBE ANALYZE INSERT ... will fail with an error.

`sql.Describable`

Previously, each Node or Expression had two different string representations, implemented as String() and DebugString(). DESCRIBE could emit either by using format=tree (the default) or format=debug. This isn't really a scalable solution, and the number of levers we want to control plan formatting may increase further over time.

So this PR also adds a new interface for controlling how plans trees are displayed: sql.Describable. It has one method: Describe(options sql.DescribeOptions) string. DescribeOptions is a struct that contains all options that control how plans are displayed, and more options can be added as needed.

These options are set via the value of the format= parameter in the describe query, which now accepts an underscore-separated list of options. Eg, format=debug_estimates.

When writing the string representation of a node with children, you should call sql.Describe(node, options) instead of calling node.String() or sql.DebugString(node). This function calls Describe on the node if it exists, and otherwise falls back on DebugString or String. This allows for incremental support of the sql.Describable interface as needed instead of needing to add support for every node type right out of the gate.

How to add support for additional node types

While only Join nodes are currently supported, it's very easy to add support for other types. Literally all you have to do is embed the sql.DescribeStats struct in the node, and implement the Describe method, like so:

type DemoNode struct {
    sql.DescribeStats,
    ...
}

func (d DemoNode) Describe(options sql.DescribeOptions) string {
	p := sql.NewTreePrinter()
	_ = p.WriteNode("Demo")
	_ = p.WriteChildren(sql.Describe(d.Child, options))
	return p.String()
}

Testing

In order to test this functionality, every plan test generated by plangen.go now has three different expected plans: One for debug strings, one for debug + estimates, and one for debug + estimates + analyze.

This may turn out to be too noisy when we make changes to the coster, but the plan tests are already noisy wrt to costing changes. If this turns out to be too much of a hassle we can limit this to just query_plans.go instead of all the plan tests.

…rinted in an `Explain` output.

…e DescribeQuery node.

…node is iterated over.

… populate actual row counts.

…lan descriptions.

… children.

…s guarentees that the options are passed to their child nodes.

…N ANALYZE`

max-hoffman · 2024-01-10T20:18:27Z

enginetest/queries/tpcc_plans.go

@@ -49,6 +49,62 @@ SELECT c_discount, c_last, c_credit, w_tax FROM customer2, warehouse2 WHERE w_id
 			"                 ├─ name: customer2\n" +
 			"                 └─ columns: [c_id c_d_id c_w_id c_last c_credit c_discount]\n" +
 			"",
+		ExpectedEstimates: "Project\n" +


The main thing I have opinions on is how much info EXPLAIN ANALYZE should include. Most of the time, fine grained details can be provided by a regular explain, and we want the information that helps us understand where queries are expensive/making expensive mistakes.

So for this query (picked this one just because it's been a pain in my ass), other DBs might do something like discard the project node, remove most relation and filter details, and probably give cardinality estimates for filters and table relations:

LookupJoin (estimated rows=0.000 rows=22500) ├─ IndexedTableAccess(warehouse2) (estimated rows=0.000 rows=1) │ └─ index: [warehouse2.w_id] └─ Filter (estimated rows=0.000 rows=22500) ├─ (non-debug expression string one-liner) └─ IndexedTableAccess(customer2) (estimated rows=0.000 rows=100000) └─ index: [customer2.c_w_id,customer2.c_d_id,customer2.c_id]

If the EXPLAIN ANALYZE flags a bad index or weird join order, we can always dig into the verbose EXPLAIN to understand better what's going on.

Do you think we should include row count estimates and the incremental processing cost? We don't really have a notion of total processing cost for a query yet.

I think the majority of our plan tests could be replaced with the simpler format imo. I get annoyed editing join_op_tests.go as often as I do, but that suite has been enough for me to not make mistakes refactoring join costing so far. So all of the TPC-X plans, and any join plans in the regular suite as a simpler format would be better. I like the duplication for integration plans, but I'd maybe put them in a different file to make them easier to review in GitHub? The differences just get so long that they are hard to review.

Good point. I previously had it so that these tests used the full debug output, but that's not really necessary for the EXPLAIN tests, so I fixed it.

I'll look into paring down info that's in the non-debug plans but might not be needed her, like Projects and columns and ranges and whatnot.

zachmu

LGTM, just a few comments

zachmu · 2024-01-10T19:43:59Z

enginetest/enginetests.go

+func ExecuteNode(ctx *sql.Context, engine QueryEngine, node sql.Node) error {
+	iter, err := engine.EngineAnalyzer().ExecBuilder.Build(ctx, node, nil)
+	if err != nil {
+		return nil


return err?

If not, comment why we're ignoring the error

zachmu · 2024-01-10T19:44:41Z

enginetest/enginetests.go

@@ -52,6 +52,24 @@ import (
 	"github.com/dolthub/go-mysql-server/test"
 )

+// ExecuteNode iterates over a node's iterator until it's exhausted.
+// This is useful for populating actual row counts for `DESCRIBE ANALYZE`.
+func ExecuteNode(ctx *sql.Context, engine QueryEngine, node sql.Node) error {


We call this DrainIter in other places

I refactored this to call DrainIter. It's not quite the same because this method also builds the iter, and it's named after it's purpose: to resolve all side effects from the query. The fact that we do that by building and draining an iter is an implementation detail.

zachmu · 2024-01-10T21:34:34Z

sql/describe.go

+	Describe(options DescribeOptions) string
+}
+
+func Describe(n fmt.Stringer, options DescribeOptions) string {


Method doc here, describe what the method is for

nicktobey and others added 21 commits January 8, 2024 09:55

Create ExplainStats struct for storing stats on nodes that could be p…

04b4497

…rinted in an `Explain` output.

Store whether the EXPLAIN query is an EXPLAIN ANALYZE query in th…

ecb1daa

…e DescribeQuery node.

When building nodes, populate the Explain stats if the node has one.

0a5ff7e

Include runtime stats in ExplainStats.

df4a8c0

Create new iterator wrapper that counts rows and number of times the …

4d6ce5a

…node is iterated over.

For EXPLAIN ANALYZE, iterate over the rows of the query in order to…

6c55bd2

… populate actual row counts.

Create new Describable interface for more fine-grained control over p…

8b48e9c

…lan descriptions.

Implmement Describable and Explainable for Join.

0496770

DescribeQuery now holds DescribeOptions and calls Describe on its…

e740165

… children.

Implmement Describable for Project and TransactionCommittingNode. Thi…

1329918

…s guarentees that the options are passed to their child nodes.

TestQueryPlan takes a DescribeOptions.

0558ace

Add the Describable interface to more node types.

0932ae0

Improve EngineTests to check output for format=estimate and `EXPLAI…

b46c146

…N ANALYZE`

Update QueryPlans with estimates and analysis.

d69a1f3

AMEND Only display actual counts if they're actually requested.

8011749

Add estimates for other auto-generated plans.

51ae2ad

Rename types to consistently use "Describe" over "Explain"

a1e2fe9

Update test.

94a0060

Rename explain.go to describe.go

c3e4822

[ga-format-pr] Run ./format_repo.sh to fix formatting

822b62c

Update memo_gen_test.go

74e2c46

max-hoffman reviewed Jan 10, 2024

View reviewed changes

zachmu approved these changes Jan 10, 2024

View reviewed changes

nicktobey added 3 commits January 11, 2024 11:12

For plan tests, don't print debug info for Explain/Analysis plan tests.

f7f9246

ExecuteNode should call DrainIterator

7ad5a7b

Add docstring for sql.Describe.

e60538c

nicktobey merged commit 94a67fb into main Jan 18, 2024
7 checks passed

nicktobey deleted the nicktobey/analyze branch January 18, 2024 18:27

timsutton mentioned this pull request Jan 19, 2024

dolt 1.32.0 Homebrew/homebrew-core#160356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `DESCRIBE ANALYZE` / `EXPLAIN ANALYZE` to display stats data about joins. #2248

Use `DESCRIBE ANALYZE` / `EXPLAIN ANALYZE` to display stats data about joins. #2248

nicktobey commented Jan 10, 2024 •

edited

Loading

max-hoffman Jan 10, 2024

max-hoffman Jan 10, 2024

nicktobey Jan 11, 2024

zachmu left a comment

zachmu Jan 10, 2024

zachmu Jan 10, 2024

nicktobey Jan 16, 2024

zachmu Jan 10, 2024

nicktobey Jan 16, 2024

zachmu Jan 10, 2024

nicktobey Jan 16, 2024

Use DESCRIBE ANALYZE / EXPLAIN ANALYZE to display stats data about joins. #2248

Use DESCRIBE ANALYZE / EXPLAIN ANALYZE to display stats data about joins. #2248

Conversation

nicktobey commented Jan 10, 2024 • edited Loading

sql.DescribeStats

sql.Describable

How to add support for additional node types

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachmu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use `DESCRIBE ANALYZE` / `EXPLAIN ANALYZE` to display stats data about joins. #2248

Use `DESCRIBE ANALYZE` / `EXPLAIN ANALYZE` to display stats data about joins. #2248

nicktobey commented Jan 10, 2024 •

edited

Loading

`sql.DescribeStats`

`sql.Describable`