Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join estimation algos #2212

Merged
merged 27 commits into from
Jan 17, 2024
Merged

Join estimation algos #2212

merged 27 commits into from
Jan 17, 2024

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Dec 21, 2023

Code and tests related to estimating output distributions for joins. Limited to numeric types.

@max-hoffman max-hoffman marked this pull request as ready for review December 28, 2023 16:21
runSuite(t, tests, 1000, 10, false)
}

func runSuite(t *testing.T, tests []statsTest, rowCnt, bucketCnt int, debug bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring explaining the what runSuite is evaluating. I gather that it's getting the histogram of the join between the tables, but what is it asserting about the histogram?

}
}

func testHistogram(ctx *sql.Context, table *plan.ResolvedTable, fields []int, buckets int) ([]sql.HistogramBucket, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a lot of logic for a test. I'm worried we're going to spend a lot of time testing the test. Is there not some equivalent code that GMS uses to compute histograms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dolt side is unique and I made the GMS side reservoir sampling to skip steps while boostrapping. We could probably use this one for the GMS implementation as a default, doubt anyone will earnestly use the memory stats soon.

expRight []sql.HistogramBucket
}{
{
left: []sql.HistogramBucket{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What determines how many buckets in the expected output? I was hoping the tests would make it clear but I'm not sure why, for instance, there's no 0 upper bound in the output of the second test.

Copy link
Contributor Author

@max-hoffman max-hoffman Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to other comment, I don't know how to split (-infinity, 10) into (-infinity, 0) and (0,10). So I extend (-infinity, 0) to (-infinity, 10) by stealing (0,10) from (0,50).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }?

Copy link
Contributor Author

@max-hoffman max-hoffman Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against tracking lower bound, i just didn't see that in other Dbs implementations so I was suspicious of whether it was practical. Pretty easy improvement later if so

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up having to implement lower bounds for the blog demo to work, it makes a big difference. On the Dolt side I'm going to leave the bound in-memory, storing it as a row with the rest of histograms was a bit awkward.

exp sql.Histogram
}{
{
left: []sql.HistogramBucket{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I'm understanding this example right.

On the left side of the merge, there are 20 distinct values, assumed to be uniformly distributed between 0 and 10.

On the right side of the merge, there are 10 distinct values, assumed to be uniformly distributed between 0 and 20.

We expect the histogram for the merge to say that there are 10 distinct values uniformly distributed between 0 and 10.

This doesn't square with my intuition, since I would expect only half of the values from the right would be possible join candidates, so the expected number of rows should be at most 5.

I may not be properly understanding what the histogram represents or how it's being computed though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you are suggesting is that we could truncate the RHS to a bound value of 10, and delete half of the contents. But the valid ranges are negative infinity to bound value. Most of the keys might be negative, all of the keys could overlap, or none could. Different databases handle the first bucket special case differently, I haven't had a lot of time to test alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, these have -infinity as the minimum bound. The result here makes sense then.

},
expLeft: []sql.HistogramBucket{
&Bucket{RowCnt: 12, DistinctCnt: 12, BoundVal: sql.Row{10}, BoundCnt: 1},
&Bucket{RowCnt: 6, DistinctCnt: 6, BoundVal: sql.Row{20}, BoundCnt: 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 2 for the row bound by 20, and 6 for the row bound by 50?

If I'm reading left correctly, then the range from 0 to 50 has a density of 2 rows per 10 values. So the bucket with a range of 10 (10-20) should have 2, and the bucket with a range of 30 (20-50) should have 6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you're right, I'll see what's going on there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a I swapped the cut fractions, good eye.

expRight []sql.HistogramBucket
}{
{
left: []sql.HistogramBucket{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }?

&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{30}, BoundCnt: 1},
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{50}, BoundCnt: 1},
},
expLeft: []sql.HistogramBucket{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, couldn't the result of this be:

&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{0}, BoundCnt: 1},
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, BoundCnt: 1},
&Bucket{RowCnt: 23, DistinctCnt: 3, BoundVal: sql.Row{20}, BoundCnt: 1},
&Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{30}, BoundCnt: 1},
&Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{40}, BoundCnt: 1},

exp sql.Histogram
}{
{
left: []sql.HistogramBucket{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, these have -infinity as the minimum bound. The result here makes sense then.

&Bucket{RowCnt: 20, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{1}, {2}}, McvsCnt: []uint64{5, 5}, BoundCnt: 1},
},
right: []sql.HistogramBucket{
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{2}}, McvsCnt: []uint64{4}, BoundCnt: 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought looking at this test was: "Every row in the right is distinct, so there can't be more than 20 rows in the join." Then I was surprised that the expected result was 30.

I think the bucket here is impossible, since it's not possible to have 10 values, all distinct, and also have one value with a count of 4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this tracks, will fix

@max-hoffman max-hoffman merged commit 6800629 into main Jan 17, 2024
@max-hoffman max-hoffman deleted the max/join-estimation branch January 17, 2024 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants