Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge joins populate join stats #2247

Merged
merged 10 commits into from
Jan 20, 2024
Merged

Merge joins populate join stats #2247

merged 10 commits into from
Jan 20, 2024

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Jan 9, 2024

When merge joins are a join operator for a memo group, use the two indexes in the merge to estimate the join cardinality. Small updates so that join cardinality estimates work in the coster. A few tests that make use of join statistics. The tests are affected both by stat estimates and costing methodology. It's a bit hard to separate the two, since more accurate stat estimates so often identify issues with costing. The join statistic tests are subject to shifting based on whether the smallest table is estimated to be smaller than the smallest join cardinality estimate. Better tests would be less subject to noise. Tests for avoiding anti-patterns for specific join operators would also be useful.

@max-hoffman max-hoffman changed the title Merge join's populate join stats Merge joins populate join stats Jan 12, 2024
Base automatically changed from max/join-estimation to main January 17, 2024 02:24
Joins with a merge join option and table statistics for
both merge join indexes will use the histogram merging
logic for estimating join cardinality.
@max-hoffman max-hoffman marked this pull request as ready for review January 18, 2024 20:40
if grp == nil {
return m.NewExprGroup(rel)
grp = m.NewExprGroup(rel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified as:

rel.Injective = isInjectiveMerge(rel, leftCompareExprs, rightCompareExprs)
if grp == nil {
	return m.NewExprGroup(rel)
}
rel.CmpCnt = len(leftCompareExprs)
...

func floatMax(i, j float64) float64 {
if i > j {
return i
func InterpolateNewCounts(from, to sql.Statistic) sql.Statistic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring for this funciton?

var mergeStats sql.Statistic
var n RelExpr = rel
var done bool
for n != nil && !done {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this end up being O(N^2) for a group of N expressions? (Like, if it gets called for each expression as it gets added to the group, then the first will have 1 iteration, the second 2, and so on...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CardMemoGroups is called once per join planning I think, so at most this should only traverse the group once. It could be shorter even, if we find an injective lookup there's no more info to check.

if lIdx, ok := n.Left.First.(*IndexScan); ok {
leftChildStats = lIdx.Stats
}
if rIdx, ok := n.Right.First.(*IndexScan); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these aren't index scans, we just don't use their stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge join children have index scans when we convert a filter into a range scan. If there are costed range scans, we either use the filtered index scan (if it's the same as the merge join), or apply the selectivity estimate from the range scan to the merge index buckets.

rightChildStats = rIdx.Stats
}

// single prefix always safe for merge join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment elaborate? It's not clear how it relates to the following line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a todo maybe. I had a hard time finding how many index columns were equality prefix comparable between the two indexes, and recording MergeJoin.CmpCnt had some issues when I was testing. One is always safe so I went with that.

switch n := n.(type) {
case *LookupJoin:
injective = injective || n.Injective
case *MergeJoin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section could use some more comments explaining why certain checks / assignments are done.

@max-hoffman max-hoffman merged commit f795b05 into main Jan 20, 2024
7 checks passed
@Hydrocharged Hydrocharged deleted the max/merge-join-stats branch February 7, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants