sql/stats: add simple linear regression over float64s and quantiles #84659

michae2 · 2022-07-19T17:42:59Z

Statistics forecasting will initially use simple linear regression over
time to predict all table statistics (only keeping predictions that fit
the linear model well). Most table statistics are scalar float64s, for
which we can use a textbook ordinary least squares (OLS) method with x
as time and y as the table statistic.

For histograms, we use a variant of OLS based on the 2010 paper
"Ordinary Least Squares for Histogram Data Based on Wasserstein
Distance" by Verde and Irpino. The paper outlines an OLS method when
both x and y are histograms. In our case, x is a scalar, so we adjust
the math slightly.

Assists: #79872

Release note: None

cockroach-teamcity · 2022-07-19T17:43:10Z

This change is

michae2 · 2022-07-19T17:44:26Z

This has been pulled out of #77070 and cleaned up a bit.

This is the math-heavy PR. After this, no more math, I swear! 🙂

michae2 · 2022-07-19T19:15:48Z

Made a linter happy.

yuzefovich

Kudos for great documentation! I have some nits, but I didn't review the math too closely, so will defer to others for approval 😄

Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

		if q[i].p < r[j].p {
			// Find the value of r within the line segment at q[i].p.
			m, b := quantilePiece(r[j-1], r[j])

nit: the calculation in these two lines of rv is repeated several times - maybe extract it out in another helper?

pkg/sql/stats/quantile.go line 526 at r2 (raw file):

	// Handle any trailing p=1 points.
	for ; i < len(q); i++ {

nit: maybe use len(r)-1 and len(q)-1 in these loops instead of j-1 and i-1 respectively to make things more clear?

pkg/sql/stats/quantile.go line 688 at r2 (raw file):

	// as the original malformed quantile function.

	fixed := make(quantile, 0, len(q))

nit: should we introduce a fast-path where we check whether q is malformed and short-circuit if it is well-formed already?

pkg/sql/stats/quantile.go line 709 at r2 (raw file):

			// Right point of pieces which have one endpoint <= val and one endpoint >
			// val.
			crossingPieces []quantileIndex

nit: I think we can reduce the number of allocations here. IIUC having only a single slice for crossingPieces would be sufficient for all values of i, i.e. we'd do something like crossingPieces := prevCrossingPieces[:0] and call it day. Ditto for eqPieces and for intersectionPs where we can use a "global" scratch. It's probably better to not optimize the code right away, so maybe leave a TODO about this?

mgartner

Kudos for great documentation!

+:100:

I left a few nits, but overall looks great!

Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

pkg/sql/stats/quantile.go line 489 at r2 (raw file):

// called on a discontinuity (vertical line where p₁ = p₂).
func quantilePiece(c, d quantilePoint) (m, b float64) {
	m = (d.v - c.v) / (d.p - c.p)

nit: should we return an assertion error if d.p - c.p == 0?

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

		if q[i].p < r[j].p {
			// Find the value of r within the line segment at q[i].p.
			m, b := quantilePiece(r[j-1], r[j])

Is there something implicit here that prevents r[j-1] to r[j] from being a discontinuity? I have the same question for the other usages of quantilePiece below.

pkg/sql/stats/quantile.go line 540 at r2 (raw file):

// on how their points align. The new quantile function might be malformed, even
// if both q and r were well-formed.
func (q quantile) sub(r quantile) quantile {

nit: It looks like sub is identical to add except for the operator used to combine values. You could combine them into a single implementation that takes either func(x, y float64) float64 { return x + y } or func(x, y float64) float64 { return x - y }. (Too bad + and - aren't first class functions in Go.)

pkg/sql/stats/simple_linear_regression.go line 81 at r2 (raw file):

		r2 = 1 - Σεε/Σyy
	}
	return

nit: return yₙ, r2

pkg/sql/stats/simple_linear_regression.go line 184 at r2 (raw file):

		r2 = 1 - Σεε/Σyy
	}
	return

nit: return yₙ, r2

rytaft

I love this!

Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2)

pkg/sql/stats/quantile.go line 491 at r2 (raw file):

	m = (d.v - c.v) / (d.p - c.p)
	b = c.v - m*c.p
	return

nit: even with named return params, we usually try to be explicit and say return m, b

pkg/sql/stats/quantile.go line 735 at r2 (raw file):

		// also moves i past all points with the current val.
		for ; i < len(pointsByValue) && q[pointsByValue[i]].v == val; i++ {
			b := pointsByValue[i]

nit: can you use a diff variable name than b? I think of b as the intercept now. Maybe j?

pkg/sql/stats/quantile.go line 744 at r2 (raw file):

				crossingPieces = append(crossingPieces, b+1)
			}
			// Also look for flat pieces = val. We only need to check to the left.

nit: left -> right?

pkg/sql/stats/quantile.go line 788 at r2 (raw file):

		lessEqP := intersectionPs[0]
		for j := 1; j < len(intersectionPs); j += 2 {
			lessEqP += intersectionPs[j+1] - intersectionPs[j]

Looks like this could cause index out of bounds

pkg/sql/stats/simple_linear_regression.go line 34 at r2 (raw file):

	// * x̅ = (1/n) ∑ xᵢ
	// * y̅ = (1/n) ∑ yᵢ
	// * ŷᵢ = α̂ + β̂xᵢ

love all the math symbols! (and the comment is great too!) 😄

pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

	// background.

	n := len(x)

Should you add an assertion that y has the same length?

pkg/sql/stats/simple_linear_regression.go line 136 at r2 (raw file):

	// https://en.wikipedia.org/wiki/Earth_mover%27s_distance for some background.

	n := len(x)

ditto, should we assert y has the same length?

pkg/sql/stats/quantile_test.go line 1254 at r2 (raw file):

		},
	}
	for i, tc := range testCases {

What do you think about also testing a few random quantiles and making sure the invariants hold?

Make a linter happy. Release note: None

michae2

TFTRs!

In addition to responding to various comments, I added random tests of the two simple linear regression functions. This made me realize that fixing malformed quantiles inside quantileSimpleLinearRegression made the returned R² value overly-pessimistic, and also made the function really hard to test. So now I'll call fixMalformed on the return value of quantileSimpleLinearRegression instead of relying on quantileSimpleLinearRegression to call it for me.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rytaft, and @yuzefovich)

pkg/sql/stats/quantile.go line 489 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: should we return an assertion error if d.p - c.p == 0?

It kills me a little to add error handling or allocation in a little pure math function like this. (But maybe I am being a crotchety old unsafe C programmer?)

pkg/sql/stats/quantile.go line 491 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: even with named return params, we usually try to be explicit and say return m, b

Done.

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: the calculation in these two lines of rv is repeated several times - maybe extract it out in another helper?

I moved it into quantilePieceV (and also added a quantilePieceP used by another place).

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Is there something implicit here that prevents r[j-1] to r[j] from being a discontinuity? I have the same question for the other usages of quantilePiece below.

Yes, there is (great question):

We visited r[j-1] before q[i] implying r[j-1].p <= q[i].p.
Now we're in the q[i].p < r[j].p case.
So putting those together, we know that r[j-1].p <= q[i].p < r[j].p and therefore r[j-1].p < r[j].p meaning the piece from r[j-1] to r[j] isn't a discontinuity.

pkg/sql/stats/quantile.go line 526 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe use len(r)-1 and len(q)-1 in these loops instead of j-1 and i-1 respectively to make things more clear?

Done.

pkg/sql/stats/quantile.go line 540 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: It looks like sub is identical to add except for the operator used to combine values. You could combine them into a single implementation that takes either func(x, y float64) float64 { return x + y } or func(x, y float64) float64 { return x - y }. (Too bad + and - aren't first class functions in Go.)

Good point. I factored out a binaryOp method, LMK what you think.

pkg/sql/stats/quantile.go line 688 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: should we introduce a fast-path where we check whether q is malformed and short-circuit if it is well-formed already?

Oh, great point. Yes, absolutely. Done.

pkg/sql/stats/quantile.go line 709 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think we can reduce the number of allocations here. IIUC having only a single slice for crossingPieces would be sufficient for all values of i, i.e. we'd do something like crossingPieces := prevCrossingPieces[:0] and call it day. Ditto for eqPieces and for intersectionPs where we can use a "global" scratch. It's probably better to not optimize the code right away, so maybe leave a TODO about this?

Good point. I moved the allocations outside the loop, and this also inspired me to only use a single crossingPieces slice.

pkg/sql/stats/quantile.go line 735 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: can you use a diff variable name than b? I think of b as the intercept now. Maybe j?

Ah yeah, great point. I tried using qi, WDYT?

pkg/sql/stats/quantile.go line 744 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: left -> right?

Oh! Good catch! Done.

pkg/sql/stats/quantile.go line 788 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Looks like this could cause index out of bounds

Oh, I meant to leave a comment about this: intersectionPs is guaranteed to always have an odd number of items at this point. (Is it too tricky to rely on that? WDYT?)

pkg/sql/stats/simple_linear_regression.go line 34 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

love all the math symbols! (and the comment is great too!) 😄

I'm glad, I hope it isn't too silly to use all the symbols 😛

pkg/sql/stats/simple_linear_regression.go line 81 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: return yₙ, r2

Done.

pkg/sql/stats/simple_linear_regression.go line 184 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: return yₙ, r2

Done.

pkg/sql/stats/quantile_test.go line 1254 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What do you think about also testing a few random quantiles and making sure the invariants hold?

Oh, great idea. I added a randQuantile function for some tests in the other file. I'll swing around and do this tomorrow.

mgartner

Reviewed 5 of 5 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

pkg/sql/stats/quantile.go line 489 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

It kills me a little to add error handling or allocation in a little pure math function like this. (But maybe I am being a crotchety old unsafe C programmer?)

Feel free to leave as-is.

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Yes, there is (great question):

We visited r[j-1] before q[i] implying r[j-1].p <= q[i].p.

Now we're in the q[i].p < r[j].p case.

So putting those together, we know that r[j-1].p <= q[i].p < r[j].p and therefore r[j-1].p < r[j].p meaning the piece from r[j-1] to r[j] isn't a discontinuity.

Thanks for the explanation!

For the case when i=1, j=1, there cannot be a discontinuity because it is impossible for the points x[0] and x[1] of any quantile function x to be a discontinuity, correct? Is that a fourth point that we should mention in the comment you added?

pkg/sql/stats/quantile.go line 540 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Good point. I factored out a binaryOp method, LMK what you think.

👍

nit: add a comment explaining the op argument.

rytaft

Reviewed 5 of 5 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

pkg/sql/stats/quantile.go line 491 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Done.

Doesn't look like you pushed the change (now line 494)

pkg/sql/stats/quantile.go line 735 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Ah yeah, great point. I tried using qi, WDYT?

LGTM

pkg/sql/stats/quantile.go line 514 at r4 (raw file):

}

func (q quantile) binaryOp(r quantile, op func(qv, rv float64) float64) quantile {

nit: add a function comment

pkg/sql/stats/quantile_test.go line 250 at r4 (raw file):

	}
	sort.Float64s(ps)
	sort.Float64s(vs)

Does this ensure that the resulting quantile is well-formed? Should we produce a malformed quantile some fraction of the time?

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @michae2, and @rytaft)

pkg/sql/stats/quantile.go line 491 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Doesn't look like you pushed the change (now line 494)

Gah, sorry! Done for real this time.

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Thanks for the explanation!

For the case when i=1, j=1, there cannot be a discontinuity because it is impossible for the points x[0] and x[1] of any quantile function x to be a discontinuity, correct? Is that a fourth point that we should mention in the comment you added?

The piece from r[0] to r[1] actually could be a discontinuity in a quantile, but we still won't call quantilePiece on this as long as both r[0].p == 0 and q[0].p == 0 (because then r[1].p will also be zero, and we'll either fall into the r[j].p < q[i].p case or the r[j].p == q[i].p case).

So the assumption we're relying on for the i=1, j=1 case is that both r[0].p == 0 and q[0].p == 0 (or more generally r[0].p == q[0].p). I added this to the comment.

pkg/sql/stats/quantile.go line 540 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

👍

nit: add a comment explaining the op argument.

Done.

pkg/sql/stats/quantile.go line 514 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: add a function comment

Done.

pkg/sql/stats/quantile_test.go line 1254 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Oh, great idea. I added a randQuantile function for some tests in the other file. I'll swing around and do this tomorrow.

Ok, added TestQuantileOpsRandom.

pkg/sql/stats/quantile_test.go line 250 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Does this ensure that the resulting quantile is well-formed? Should we produce a malformed quantile some fraction of the time?

Oh, good point. Now we produce a malformed quantile some of the time.

rytaft

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

pkg/sql/stats/quantile.go line 529 at r5 (raw file):

			// This piece of r will never be a discontinuity. Proof:
			// 1. We visited r[j-1] before q[i], implying r[j-1].p <= q[i].p. (For the
			//    i=1 j=1 case, this relies on the assumption that q[0].p = r[0].p.)

nit: Maybe make it explicit that q[0].p = r[0].p = 0?

rytaft

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)

pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Should you add an assertion that y has the same length?

Just want to make sure this comment didn't get lost

pkg/sql/stats/simple_linear_regression.go line 136 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto, should we assert y has the same length?

ditto

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rytaft)

pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Just want to make sure this comment didn't get lost

How would you feel about a panic if the lengths are different? Initially this function will only be called from one place (ignoring testing), so I'd rather not add an error to the function signature for this.

Statistics forecasting will initially use simple linear regression over time to predict all table statistics (only keeping predictions that fit the linear model well). Most table statistics are scalar float64s, for which we can use a textbook ordinary least squares (OLS) method with x as time and y as the table statistic. For histograms, we use a variant of OLS based on the 2010 paper "Ordinary Least Squares for Histogram Data Based on Wasserstein Distance" by Verde and Irpino. The paper outlines an OLS method when both x and y are histograms. In our case, x is a scalar, so we adjust the math slightly. Assists: cockroachdb#79872 Release note: None

michae2

TFTRs! I really appreciate the detailed reads!!!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner and @rytaft)

pkg/sql/stats/quantile.go line 529 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: Maybe make it explicit that q[0].p = r[0].p = 0?

Added to the comment.

pkg/sql/stats/simple_linear_regression.go line 136 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto

Added a panic.

rytaft

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @michae2)

pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

How would you feel about a panic if the lengths are different? Initially this function will only be called from one place (ignoring testing), so I'd rather not add an error to the function signature for this.

SGTM

craig · 2022-07-25T22:13:35Z

Build succeeded:

GitHub CI (Cockroach)

mgartner · 2022-07-25T22:20:59Z

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

The piece from r[0] to r[1] actually could be a discontinuity in a quantile, but we still won't call quantilePiece on this as long as both r[0].p == 0 and q[0].p == 0 (because then r[1].p will also be zero, and we'll either fall into the r[j].p < q[i].p case or the r[j].p == q[i].p case).

So the assumption we're relying on for the i=1, j=1 case is that both r[0].p == 0 and q[0].p == 0 (or more generally r[0].p == q[0].p). I added this to the comment.

I see. Thanks for explaining!

michae2 requested a review from a team as a code owner July 19, 2022 17:43

michae2 requested review from mgartner, rytaft and yuzefovich July 19, 2022 17:43

michae2 mentioned this pull request Jul 19, 2022

opt: predict future statistics based on historical stats #79872

Closed

michae2 force-pushed the quantile branch from 2c1e49a to 90e2aed Compare July 19, 2022 19:15

yuzefovich reviewed Jul 19, 2022

View reviewed changes

mgartner approved these changes Jul 19, 2022

View reviewed changes

rytaft reviewed Jul 24, 2022

View reviewed changes

sql/stats: rename receiver in quantile methods from qf to q

e63324c

Make a linter happy. Release note: None

michae2 force-pushed the quantile branch from 90e2aed to f1c7c68 Compare July 25, 2022 05:37

michae2 commented Jul 25, 2022

View reviewed changes

mgartner approved these changes Jul 25, 2022

View reviewed changes

rytaft reviewed Jul 25, 2022

View reviewed changes

michae2 force-pushed the quantile branch from f1c7c68 to 98332c6 Compare July 25, 2022 17:54

michae2 commented Jul 25, 2022

View reviewed changes

rytaft approved these changes Jul 25, 2022

View reviewed changes

rytaft reviewed Jul 25, 2022

View reviewed changes

michae2 commented Jul 25, 2022

View reviewed changes

michae2 force-pushed the quantile branch from 98332c6 to bf2c151 Compare July 25, 2022 19:50

michae2 commented Jul 25, 2022

View reviewed changes

rytaft approved these changes Jul 25, 2022

View reviewed changes

craig bot merged commit 69e5187 into cockroachdb:master Jul 25, 2022

michae2 deleted the quantile branch July 26, 2022 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/stats: add simple linear regression over float64s and quantiles #84659

sql/stats: add simple linear regression over float64s and quantiles #84659

michae2 commented Jul 19, 2022 •

edited

Loading

cockroach-teamcity commented Jul 19, 2022

michae2 commented Jul 19, 2022

michae2 commented Jul 19, 2022

yuzefovich left a comment

mgartner left a comment

rytaft left a comment

michae2 left a comment

mgartner left a comment

rytaft left a comment

michae2 left a comment

rytaft left a comment

rytaft left a comment

michae2 left a comment

michae2 left a comment

rytaft left a comment

craig bot commented Jul 25, 2022

mgartner commented Jul 25, 2022

sql/stats: add simple linear regression over float64s and quantiles #84659

sql/stats: add simple linear regression over float64s and quantiles #84659

Conversation

michae2 commented Jul 19, 2022 • edited Loading

cockroach-teamcity commented Jul 19, 2022

michae2 commented Jul 19, 2022

michae2 commented Jul 19, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

mgartner left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

craig bot commented Jul 25, 2022

mgartner commented Jul 25, 2022

michae2 commented Jul 19, 2022 •

edited

Loading