Join estimation algos #2212

max-hoffman · 2023-12-21T19:57:24Z

Code and tests related to estimating output distributions for joins. Limited to numeric types.

…ver into max/join-estimation

enginetest/histogram_test.go

nicktobey · 2024-01-08T19:14:15Z

enginetest/histogram_test.go

+	runSuite(t, tests, 1000, 10, false)
+}
+
+func runSuite(t *testing.T, tests []statsTest, rowCnt, bucketCnt int, debug bool) {


Can you add a docstring explaining the what runSuite is evaluating. I gather that it's getting the histogram of the join between the tables, but what is it asserting about the histogram?

nicktobey · 2024-01-08T19:17:36Z

enginetest/histogram_test.go

+	}
+}
+
+func testHistogram(ctx *sql.Context, table *plan.ResolvedTable, fields []int, buckets int) ([]sql.HistogramBucket, error) {


This feels like a lot of logic for a test. I'm worried we're going to spend a lot of time testing the test. Is there not some equivalent code that GMS uses to compute histograms?

The Dolt side is unique and I made the GMS side reservoir sampling to skip steps while boostrapping. We could probably use this one for the GMS implementation as a default, doubt anyone will earnestly use the memory stats soon.

sql/stats/join.go

nicktobey · 2024-01-09T20:33:19Z

sql/stats/join_test.go

+		expRight []sql.HistogramBucket
+	}{
+		{
+			left: []sql.HistogramBucket{


What determines how many buckets in the expected output? I was hoping the tests would make it clear but I'm not sure why, for instance, there's no 0 upper bound in the output of the second test.

related to other comment, I don't know how to split (-infinity, 10) into (-infinity, 0) and (0,10). So I extend (-infinity, 0) to (-infinity, 10) by stealing (0,10) from (0,50).

Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }?

I'm not against tracking lower bound, i just didn't see that in other Dbs implementations so I was suspicious of whether it was practical. Pretty easy improvement later if so

I ended up having to implement lower bounds for the blog demo to work, it makes a big difference. On the Dolt side I'm going to leave the bound in-memory, storing it as a row with the rest of histograms was a bit awkward.

nicktobey · 2024-01-09T20:52:27Z

sql/stats/join_test.go

+		exp   sql.Histogram
+	}{
+		{
+			left: []sql.HistogramBucket{


I want to make sure I'm understanding this example right.

On the left side of the merge, there are 20 distinct values, assumed to be uniformly distributed between 0 and 10.

On the right side of the merge, there are 10 distinct values, assumed to be uniformly distributed between 0 and 20.

We expect the histogram for the merge to say that there are 10 distinct values uniformly distributed between 0 and 10.

This doesn't square with my intuition, since I would expect only half of the values from the right would be possible join candidates, so the expected number of rows should be at most 5.

I may not be properly understanding what the histogram represents or how it's being computed though.

I think what you are suggesting is that we could truncate the RHS to a bound value of 10, and delete half of the contents. But the valid ranges are negative infinity to bound value. Most of the keys might be negative, all of the keys could overlap, or none could. Different databases handle the first bucket special case differently, I haven't had a lot of time to test alternatives.

Ah right, these have -infinity as the minimum bound. The result here makes sense then.

nicktobey · 2024-01-10T17:50:46Z

sql/stats/join_test.go

+			},
+			expLeft: []sql.HistogramBucket{
+				&Bucket{RowCnt: 12, DistinctCnt: 12, BoundVal: sql.Row{10}, BoundCnt: 1},
+				&Bucket{RowCnt: 6, DistinctCnt: 6, BoundVal: sql.Row{20}, BoundCnt: 1},


Shouldn't this be 2 for the row bound by 20, and 6 for the row bound by 50?

If I'm reading left correctly, then the range from 0 to 50 has a density of 2 rows per 10 values. So the bucket with a range of 10 (10-20) should have 2, and the bucket with a range of 30 (20-50) should have 6.

yeah you're right, I'll see what's going on there

Looks like a I swapped the cut fractions, good eye.

nicktobey · 2024-01-10T17:53:00Z

sql/stats/join_test.go

+		expRight []sql.HistogramBucket
+	}{
+		{
+			left: []sql.HistogramBucket{


Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }?

nicktobey · 2024-01-10T17:56:05Z

sql/stats/join_test.go

+				&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{30}, BoundCnt: 1},
+				&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{50}, BoundCnt: 1},
+			},
+			expLeft: []sql.HistogramBucket{


Similar to above, couldn't the result of this be:

&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{0}, BoundCnt: 1}, &Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, BoundCnt: 1}, &Bucket{RowCnt: 23, DistinctCnt: 3, BoundVal: sql.Row{20}, BoundCnt: 1}, &Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{30}, BoundCnt: 1}, &Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{40}, BoundCnt: 1},

nicktobey · 2024-01-10T17:58:02Z

sql/stats/join_test.go

+		exp   sql.Histogram
+	}{
+		{
+			left: []sql.HistogramBucket{


Ah right, these have -infinity as the minimum bound. The result here makes sense then.

nicktobey · 2024-01-10T18:08:13Z

sql/stats/join_test.go

+				&Bucket{RowCnt: 20, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{1}, {2}}, McvsCnt: []uint64{5, 5}, BoundCnt: 1},
+			},
+			right: []sql.HistogramBucket{
+				&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{2}}, McvsCnt: []uint64{4}, BoundCnt: 1},


My first thought looking at this test was: "Every row in the right is distinct, so there can't be more than 20 rows in the join." Then I was surprised that the expected result was 30.

I think the bucket here is impossible, since it's not possible to have 10 values, all distinct, and also have one value with a count of 4.

Yeah this tracks, will fix

max-hoffman and others added 10 commits December 13, 2023 18:21

starter for join estimation

b5adbba

start for join estimation

86b6a3a

progress for bucket merge

3ee01b5

more progress for tests

347cd69

more fixes

e134da0

more bug fix progress

e1ca83b

Merge branch 'main' into max/join-estimation

d06dd75

edits

b2e92f7

fix test scoping, more bugs

1a8e5e2

more tests and fixes

b239ec0

max-hoffman force-pushed the max/join-estimation branch from bbfc312 to b239ec0 Compare December 27, 2023 20:00

max-hoffman and others added 8 commits December 27, 2023 20:02

[ga-format-pr] Run ./format_repo.sh to fix formatting

a364a38

more test cleanup

84b5cee

Merge branch 'max/join-estimation' of github.com:dolthub/go-mysql-ser…

5507c8f

…ver into max/join-estimation

better debug

4388270

more CI debug

1f20313

back to original seed

e33c0b9

better comments

579ccde

[ga-format-pr] Run ./format_repo.sh to fix formatting

d201c15

max-hoffman marked this pull request as ready for review December 28, 2023 16:21

max-hoffman and others added 6 commits January 4, 2024 08:10

float helpers

790cc47

distributions

0437be7

[ga-format-pr] Run ./format_repo.sh to fix formatting

8863035

fix distribution table names

ecb96d7

fmt

80cbc95

Merge branch 'main' into max/join-estimation

ec38429

nicktobey reviewed Jan 8, 2024

View reviewed changes

nick comments

ca1d902

nicktobey reviewed Jan 9, 2024

View reviewed changes

nicktobey reviewed Jan 10, 2024

View reviewed changes

max-hoffman added 2 commits January 12, 2024 14:44

more nick comments

58619f8

todo for stat lower bound

6df9515

max-hoffman merged commit 6800629 into main Jan 17, 2024

max-hoffman deleted the max/join-estimation branch January 17, 2024 02:24

BrewTestBot mentioned this pull request Jan 18, 2024

dolt 1.31.3 Homebrew/homebrew-core#160256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join estimation algos #2212

Join estimation algos #2212

max-hoffman commented Dec 21, 2023 •

edited

Loading

nicktobey Jan 8, 2024

nicktobey Jan 8, 2024

max-hoffman Jan 8, 2024

nicktobey Jan 9, 2024

max-hoffman Jan 9, 2024 •

edited

Loading

nicktobey Jan 10, 2024

max-hoffman Jan 12, 2024 •

edited

Loading

max-hoffman Jan 13, 2024

nicktobey Jan 9, 2024

max-hoffman Jan 9, 2024

nicktobey Jan 10, 2024

nicktobey Jan 10, 2024

max-hoffman Jan 12, 2024

max-hoffman Jan 12, 2024

nicktobey Jan 10, 2024

nicktobey Jan 10, 2024

nicktobey Jan 10, 2024

nicktobey Jan 10, 2024

max-hoffman Jan 12, 2024

Join estimation algos #2212

Join estimation algos #2212

Conversation

max-hoffman commented Dec 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-hoffman commented Dec 21, 2023 •

edited

Loading

max-hoffman Jan 9, 2024 •

edited

Loading

max-hoffman Jan 12, 2024 •

edited

Loading