-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join estimation algos #2212
Join estimation algos #2212
Conversation
bbfc312
to
b239ec0
Compare
…ver into max/join-estimation
enginetest/histogram_test.go
Outdated
runSuite(t, tests, 1000, 10, false) | ||
} | ||
|
||
func runSuite(t *testing.T, tests []statsTest, rowCnt, bucketCnt int, debug bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a docstring explaining the what runSuite
is evaluating. I gather that it's getting the histogram of the join between the tables, but what is it asserting about the histogram?
} | ||
} | ||
|
||
func testHistogram(ctx *sql.Context, table *plan.ResolvedTable, fields []int, buckets int) ([]sql.HistogramBucket, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a lot of logic for a test. I'm worried we're going to spend a lot of time testing the test. Is there not some equivalent code that GMS uses to compute histograms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dolt side is unique and I made the GMS side reservoir sampling to skip steps while boostrapping. We could probably use this one for the GMS implementation as a default, doubt anyone will earnestly use the memory stats soon.
expRight []sql.HistogramBucket | ||
}{ | ||
{ | ||
left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What determines how many buckets in the expected output? I was hoping the tests would make it clear but I'm not sure why, for instance, there's no 0 upper bound in the output of the second test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to other comment, I don't know how to split (-infinity, 10) into (-infinity, 0) and (0,10). So I extend (-infinity, 0) to (-infinity, 10) by stealing (0,10) from (0,50).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against tracking lower bound, i just didn't see that in other Dbs implementations so I was suspicious of whether it was practical. Pretty easy improvement later if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up having to implement lower bounds for the blog demo to work, it makes a big difference. On the Dolt side I'm going to leave the bound in-memory, storing it as a row with the rest of histograms was a bit awkward.
exp sql.Histogram | ||
}{ | ||
{ | ||
left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure I'm understanding this example right.
On the left side of the merge, there are 20 distinct values, assumed to be uniformly distributed between 0 and 10.
On the right side of the merge, there are 10 distinct values, assumed to be uniformly distributed between 0 and 20.
We expect the histogram for the merge to say that there are 10 distinct values uniformly distributed between 0 and 10.
This doesn't square with my intuition, since I would expect only half of the values from the right would be possible join candidates, so the expected number of rows should be at most 5.
I may not be properly understanding what the histogram represents or how it's being computed though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are suggesting is that we could truncate the RHS to a bound value of 10, and delete half of the contents. But the valid ranges are negative infinity to bound value. Most of the keys might be negative, all of the keys could overlap, or none could. Different databases handle the first bucket special case differently, I haven't had a lot of time to test alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, these have -infinity as the minimum bound. The result here makes sense then.
sql/stats/join_test.go
Outdated
}, | ||
expLeft: []sql.HistogramBucket{ | ||
&Bucket{RowCnt: 12, DistinctCnt: 12, BoundVal: sql.Row{10}, BoundCnt: 1}, | ||
&Bucket{RowCnt: 6, DistinctCnt: 6, BoundVal: sql.Row{20}, BoundCnt: 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 2 for the row bound by 20, and 6 for the row bound by 50?
If I'm reading left
correctly, then the range from 0 to 50 has a density of 2 rows per 10 values. So the bucket with a range of 10 (10-20) should have 2, and the bucket with a range of 30 (20-50) should have 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you're right, I'll see what's going on there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a I swapped the cut fractions, good eye.
expRight []sql.HistogramBucket | ||
}{ | ||
{ | ||
left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }
?
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{30}, BoundCnt: 1}, | ||
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{50}, BoundCnt: 1}, | ||
}, | ||
expLeft: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, couldn't the result of this be:
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{0}, BoundCnt: 1},
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, BoundCnt: 1},
&Bucket{RowCnt: 23, DistinctCnt: 3, BoundVal: sql.Row{20}, BoundCnt: 1},
&Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{30}, BoundCnt: 1},
&Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{40}, BoundCnt: 1},
exp sql.Histogram | ||
}{ | ||
{ | ||
left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, these have -infinity as the minimum bound. The result here makes sense then.
sql/stats/join_test.go
Outdated
&Bucket{RowCnt: 20, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{1}, {2}}, McvsCnt: []uint64{5, 5}, BoundCnt: 1}, | ||
}, | ||
right: []sql.HistogramBucket{ | ||
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{2}}, McvsCnt: []uint64{4}, BoundCnt: 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought looking at this test was: "Every row in the right is distinct, so there can't be more than 20 rows in the join." Then I was surprised that the expected result was 30.
I think the bucket here is impossible, since it's not possible to have 10 values, all distinct, and also have one value with a count of 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this tracks, will fix
Code and tests related to estimating output distributions for joins. Limited to numeric types.