Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DESCRIBE ANALYZE / EXPLAIN ANALYZE to display stats data about joins. #2248

Merged
merged 24 commits into from
Jan 18, 2024

Conversation

nicktobey
Copy link
Contributor

@nicktobey nicktobey commented Jan 10, 2024

(In MySQL, DESCRIBE and EXPLAIN are aliases. GMS pretty consistently uses DESCRIBE internally, so I prefer that here.)

This PR has a couple different pieces:

sql.DescribeStats

The most important part of this PR is the mixin struct sql.DescribeStats, defined in sql/describe.go. By embedding this struct within a Node or Expression, the node will store information about the cost estimate and row count estimate of the plan that was used to generate the node. This information can then be included in the Node/Expressions's string representation by calling the method DescribeStats.GetDescribeStatsString.

Currently, only Join nodes embed this mixin.

There's two ways to surface this information during a query.

  1. DESCRIBE format=estimates SELECT ... will include the row estimates and cost estimates of every node that supports this.
  2. DESCRIBE ANALYZE SELECT ... will execute the plan, and then display the actual row counts alongside the estimates.

Note that DESCRIBE ANALYZE is only permitted for queries that don't have side effects. For instance, DESCRIBE ANALYZE INSERT ... will fail with an error.

sql.Describable

Previously, each Node or Expression had two different string representations, implemented as String() and DebugString(). DESCRIBE could emit either by using format=tree (the default) or format=debug. This isn't really a scalable solution, and the number of levers we want to control plan formatting may increase further over time.

So this PR also adds a new interface for controlling how plans trees are displayed: sql.Describable. It has one method: Describe(options sql.DescribeOptions) string. DescribeOptions is a struct that contains all options that control how plans are displayed, and more options can be added as needed.

These options are set via the value of the format= parameter in the describe query, which now accepts an underscore-separated list of options. Eg, format=debug_estimates.

When writing the string representation of a node with children, you should call sql.Describe(node, options) instead of calling node.String() or sql.DebugString(node). This function calls Describe on the node if it exists, and otherwise falls back on DebugString or String. This allows for incremental support of the sql.Describable interface as needed instead of needing to add support for every node type right out of the gate.

How to add support for additional node types

While only Join nodes are currently supported, it's very easy to add support for other types. Literally all you have to do is embed the sql.DescribeStats struct in the node, and implement the Describe method, like so:

type DemoNode struct {
    sql.DescribeStats,
    ...
}

func (d DemoNode) Describe(options sql.DescribeOptions) string {
	p := sql.NewTreePrinter()
	_ = p.WriteNode("Demo")
	_ = p.WriteChildren(sql.Describe(d.Child, options))
	return p.String()
}

Testing

In order to test this functionality, every plan test generated by plangen.go now has three different expected plans: One for debug strings, one for debug + estimates, and one for debug + estimates + analyze.

This may turn out to be too noisy when we make changes to the coster, but the plan tests are already noisy wrt to costing changes. If this turns out to be too much of a hassle we can limit this to just query_plans.go instead of all the plan tests.

nicktobey and others added 21 commits January 8, 2024 09:55
…s guarentees that the options are passed to their child nodes.
@@ -49,6 +49,62 @@ SELECT c_discount, c_last, c_credit, w_tax FROM customer2, warehouse2 WHERE w_id
" ├─ name: customer2\n" +
" └─ columns: [c_id c_d_id c_w_id c_last c_credit c_discount]\n" +
"",
ExpectedEstimates: "Project\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I have opinions on is how much info EXPLAIN ANALYZE should include. Most of the time, fine grained details can be provided by a regular explain, and we want the information that helps us understand where queries are expensive/making expensive mistakes.

So for this query (picked this one just because it's been a pain in my ass), other DBs might do something like discard the project node, remove most relation and filter details, and probably give cardinality estimates for filters and table relations:

LookupJoin (estimated rows=0.000 rows=22500)
├─ IndexedTableAccess(warehouse2) (estimated rows=0.000 rows=1)
│   └─  index: [warehouse2.w_id]
└─ Filter (estimated rows=0.000 rows=22500)
    ├─ (non-debug expression string one-liner)
    └─ IndexedTableAccess(customer2) (estimated rows=0.000 rows=100000)
       └─ index: [customer2.c_w_id,customer2.c_d_id,customer2.c_id]

If the EXPLAIN ANALYZE flags a bad index or weird join order, we can always dig into the verbose EXPLAIN to understand better what's going on.

Do you think we should include row count estimates and the incremental processing cost? We don't really have a notion of total processing cost for a query yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the majority of our plan tests could be replaced with the simpler format imo. I get annoyed editing join_op_tests.go as often as I do, but that suite has been enough for me to not make mistakes refactoring join costing so far. So all of the TPC-X plans, and any join plans in the regular suite as a simpler format would be better. I like the duplication for integration plans, but I'd maybe put them in a different file to make them easier to review in GitHub? The differences just get so long that they are hard to review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I previously had it so that these tests used the full debug output, but that's not really necessary for the EXPLAIN tests, so I fixed it.

I'll look into paring down info that's in the non-debug plans but might not be needed her, like Projects and columns and ranges and whatnot.

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few comments

func ExecuteNode(ctx *sql.Context, engine QueryEngine, node sql.Node) error {
iter, err := engine.EngineAnalyzer().ExecBuilder.Build(ctx, node, nil)
if err != nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, comment why we're ignoring the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -52,6 +52,24 @@ import (
"github.com/dolthub/go-mysql-server/test"
)

// ExecuteNode iterates over a node's iterator until it's exhausted.
// This is useful for populating actual row counts for `DESCRIBE ANALYZE`.
func ExecuteNode(ctx *sql.Context, engine QueryEngine, node sql.Node) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call this DrainIter in other places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this to call DrainIter. It's not quite the same because this method also builds the iter, and it's named after it's purpose: to resolve all side effects from the query. The fact that we do that by building and draining an iter is an implementation detail.

Describe(options DescribeOptions) string
}

func Describe(n fmt.Stringer, options DescribeOptions) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method doc here, describe what the method is for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@nicktobey nicktobey merged commit 94a67fb into main Jan 18, 2024
7 checks passed
@nicktobey nicktobey deleted the nicktobey/analyze branch January 18, 2024 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants