Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/stats: add simple linear regression over float64s and quantiles #84659

Merged
merged 2 commits into from
Jul 25, 2022

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Jul 19, 2022

Statistics forecasting will initially use simple linear regression over
time to predict all table statistics (only keeping predictions that fit
the linear model well). Most table statistics are scalar float64s, for
which we can use a textbook ordinary least squares (OLS) method with x
as time and y as the table statistic.

For histograms, we use a variant of OLS based on the 2010 paper
"Ordinary Least Squares for Histogram Data Based on Wasserstein
Distance" by Verde and Irpino. The paper outlines an OLS method when
both x and y are histograms. In our case, x is a scalar, so we adjust
the math slightly.

Assists: #79872

Release note: None

@michae2 michae2 requested a review from a team as a code owner July 19, 2022 17:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2
Copy link
Collaborator Author

michae2 commented Jul 19, 2022

This has been pulled out of #77070 and cleaned up a bit.

This is the math-heavy PR. After this, no more math, I swear! 🙂

@michae2
Copy link
Collaborator Author

michae2 commented Jul 19, 2022

Made a linter happy.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for great documentation! I have some nits, but I didn't review the math too closely, so will defer to others for approval 😄

Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/stats/quantile.go line 508 at r2 (raw file):

		if q[i].p < r[j].p {
			// Find the value of r within the line segment at q[i].p.
			m, b := quantilePiece(r[j-1], r[j])

nit: the calculation in these two lines of rv is repeated several times - maybe extract it out in another helper?


pkg/sql/stats/quantile.go line 526 at r2 (raw file):

	// Handle any trailing p=1 points.
	for ; i < len(q); i++ {

nit: maybe use len(r)-1 and len(q)-1 in these loops instead of j-1 and i-1 respectively to make things more clear?


pkg/sql/stats/quantile.go line 688 at r2 (raw file):

	// as the original malformed quantile function.

	fixed := make(quantile, 0, len(q))

nit: should we introduce a fast-path where we check whether q is malformed and short-circuit if it is well-formed already?


pkg/sql/stats/quantile.go line 709 at r2 (raw file):

			// Right point of pieces which have one endpoint <= val and one endpoint >
			// val.
			crossingPieces []quantileIndex

nit: I think we can reduce the number of allocations here. IIUC having only a single slice for crossingPieces would be sufficient for all values of i, i.e. we'd do something like crossingPieces := prevCrossingPieces[:0] and call it day. Ditto for eqPieces and for intersectionPs where we can use a "global" scratch. It's probably better to not optimize the code right away, so maybe leave a TODO about this?

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for great documentation!

+:100:

I left a few nits, but overall looks great!

Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/stats/quantile.go line 489 at r2 (raw file):

// called on a discontinuity (vertical line where p₁ = p₂).
func quantilePiece(c, d quantilePoint) (m, b float64) {
	m = (d.v - c.v) / (d.p - c.p)

nit: should we return an assertion error if d.p - c.p == 0?


pkg/sql/stats/quantile.go line 508 at r2 (raw file):

		if q[i].p < r[j].p {
			// Find the value of r within the line segment at q[i].p.
			m, b := quantilePiece(r[j-1], r[j])

Is there something implicit here that prevents r[j-1] to r[j] from being a discontinuity? I have the same question for the other usages of quantilePiece below.


pkg/sql/stats/quantile.go line 540 at r2 (raw file):

// on how their points align. The new quantile function might be malformed, even
// if both q and r were well-formed.
func (q quantile) sub(r quantile) quantile {

nit: It looks like sub is identical to add except for the operator used to combine values. You could combine them into a single implementation that takes either func(x, y float64) float64 { return x + y } or func(x, y float64) float64 { return x - y }. (Too bad + and - aren't first class functions in Go.)


pkg/sql/stats/simple_linear_regression.go line 81 at r2 (raw file):

		r2 = 1 - Σεε/Σyy
	}
	return

nit: return yₙ, r2


pkg/sql/stats/simple_linear_regression.go line 184 at r2 (raw file):

		r2 = 1 - Σεε/Σyy
	}
	return

nit: return yₙ, r2

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this!

Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2)


pkg/sql/stats/quantile.go line 491 at r2 (raw file):

	m = (d.v - c.v) / (d.p - c.p)
	b = c.v - m*c.p
	return

nit: even with named return params, we usually try to be explicit and say return m, b


pkg/sql/stats/quantile.go line 735 at r2 (raw file):

		// also moves i past all points with the current val.
		for ; i < len(pointsByValue) && q[pointsByValue[i]].v == val; i++ {
			b := pointsByValue[i]

nit: can you use a diff variable name than b? I think of b as the intercept now. Maybe j?


pkg/sql/stats/quantile.go line 744 at r2 (raw file):

				crossingPieces = append(crossingPieces, b+1)
			}
			// Also look for flat pieces = val. We only need to check to the left.

nit: left -> right?


pkg/sql/stats/quantile.go line 788 at r2 (raw file):

		lessEqP := intersectionPs[0]
		for j := 1; j < len(intersectionPs); j += 2 {
			lessEqP += intersectionPs[j+1] - intersectionPs[j]

Looks like this could cause index out of bounds


pkg/sql/stats/simple_linear_regression.go line 34 at r2 (raw file):

	// * x̅ = (1/n) ∑ xᵢ
	// * y̅ = (1/n) ∑ yᵢ
	// * ŷᵢ = α̂ + β̂xᵢ

love all the math symbols! (and the comment is great too!) 😄


pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

	// background.

	n := len(x)

Should you add an assertion that y has the same length?


pkg/sql/stats/simple_linear_regression.go line 136 at r2 (raw file):

	// https://en.wikipedia.org/wiki/Earth_mover%27s_distance for some background.

	n := len(x)

ditto, should we assert y has the same length?


pkg/sql/stats/quantile_test.go line 1254 at r2 (raw file):

		},
	}
	for i, tc := range testCases {

What do you think about also testing a few random quantiles and making sure the invariants hold?

Make a linter happy.

Release note: None
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

In addition to responding to various comments, I added random tests of the two simple linear regression functions. This made me realize that fixing malformed quantiles inside quantileSimpleLinearRegression made the returned R² value overly-pessimistic, and also made the function really hard to test. So now I'll call fixMalformed on the return value of quantileSimpleLinearRegression instead of relying on quantileSimpleLinearRegression to call it for me.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rytaft, and @yuzefovich)


pkg/sql/stats/quantile.go line 489 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: should we return an assertion error if d.p - c.p == 0?

It kills me a little to add error handling or allocation in a little pure math function like this. (But maybe I am being a crotchety old unsafe C programmer?)


pkg/sql/stats/quantile.go line 491 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: even with named return params, we usually try to be explicit and say return m, b

Done.


pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: the calculation in these two lines of rv is repeated several times - maybe extract it out in another helper?

I moved it into quantilePieceV (and also added a quantilePieceP used by another place).


pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Is there something implicit here that prevents r[j-1] to r[j] from being a discontinuity? I have the same question for the other usages of quantilePiece below.

Yes, there is (great question):

  1. We visited r[j-1] before q[i] implying r[j-1].p <= q[i].p.
  2. Now we're in the q[i].p < r[j].p case.
  3. So putting those together, we know that r[j-1].p <= q[i].p < r[j].p and therefore r[j-1].p < r[j].p meaning the piece from r[j-1] to r[j] isn't a discontinuity.

pkg/sql/stats/quantile.go line 526 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe use len(r)-1 and len(q)-1 in these loops instead of j-1 and i-1 respectively to make things more clear?

Done.


pkg/sql/stats/quantile.go line 540 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: It looks like sub is identical to add except for the operator used to combine values. You could combine them into a single implementation that takes either func(x, y float64) float64 { return x + y } or func(x, y float64) float64 { return x - y }. (Too bad + and - aren't first class functions in Go.)

Good point. I factored out a binaryOp method, LMK what you think.


pkg/sql/stats/quantile.go line 688 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: should we introduce a fast-path where we check whether q is malformed and short-circuit if it is well-formed already?

Oh, great point. Yes, absolutely. Done.


pkg/sql/stats/quantile.go line 709 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think we can reduce the number of allocations here. IIUC having only a single slice for crossingPieces would be sufficient for all values of i, i.e. we'd do something like crossingPieces := prevCrossingPieces[:0] and call it day. Ditto for eqPieces and for intersectionPs where we can use a "global" scratch. It's probably better to not optimize the code right away, so maybe leave a TODO about this?

Good point. I moved the allocations outside the loop, and this also inspired me to only use a single crossingPieces slice.


pkg/sql/stats/quantile.go line 735 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: can you use a diff variable name than b? I think of b as the intercept now. Maybe j?

Ah yeah, great point. I tried using qi, WDYT?


pkg/sql/stats/quantile.go line 744 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: left -> right?

Oh! Good catch! Done.


pkg/sql/stats/quantile.go line 788 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Looks like this could cause index out of bounds

Oh, I meant to leave a comment about this: intersectionPs is guaranteed to always have an odd number of items at this point. (Is it too tricky to rely on that? WDYT?)


pkg/sql/stats/simple_linear_regression.go line 34 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

love all the math symbols! (and the comment is great too!) 😄

I'm glad, I hope it isn't too silly to use all the symbols 😛


pkg/sql/stats/simple_linear_regression.go line 81 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: return yₙ, r2

Done.


pkg/sql/stats/simple_linear_regression.go line 184 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: return yₙ, r2

Done.


pkg/sql/stats/quantile_test.go line 1254 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

What do you think about also testing a few random quantiles and making sure the invariants hold?

Oh, great idea. I added a randQuantile function for some tests in the other file. I'll swing around and do this tomorrow.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/stats/quantile.go line 489 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

It kills me a little to add error handling or allocation in a little pure math function like this. (But maybe I am being a crotchety old unsafe C programmer?)

Feel free to leave as-is.


pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Yes, there is (great question):

  1. We visited r[j-1] before q[i] implying r[j-1].p <= q[i].p.
  2. Now we're in the q[i].p < r[j].p case.
  3. So putting those together, we know that r[j-1].p <= q[i].p < r[j].p and therefore r[j-1].p < r[j].p meaning the piece from r[j-1] to r[j] isn't a discontinuity.

Thanks for the explanation!

For the case when i=1, j=1, there cannot be a discontinuity because it is impossible for the points x[0] and x[1] of any quantile function x to be a discontinuity, correct? Is that a fourth point that we should mention in the comment you added?


pkg/sql/stats/quantile.go line 540 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Good point. I factored out a binaryOp method, LMK what you think.

👍

nit: add a comment explaining the op argument.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/stats/quantile.go line 491 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Done.

Doesn't look like you pushed the change (now line 494)


pkg/sql/stats/quantile.go line 735 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Ah yeah, great point. I tried using qi, WDYT?

LGTM


pkg/sql/stats/quantile.go line 514 at r4 (raw file):

}

func (q quantile) binaryOp(r quantile, op func(qv, rv float64) float64) quantile {

nit: add a function comment


pkg/sql/stats/quantile_test.go line 250 at r4 (raw file):

	}
	sort.Float64s(ps)
	sort.Float64s(vs)

Does this ensure that the resulting quantile is well-formed? Should we produce a malformed quantile some fraction of the time?

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/stats/quantile.go line 491 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Doesn't look like you pushed the change (now line 494)

Gah, sorry! Done for real this time.


pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Thanks for the explanation!

For the case when i=1, j=1, there cannot be a discontinuity because it is impossible for the points x[0] and x[1] of any quantile function x to be a discontinuity, correct? Is that a fourth point that we should mention in the comment you added?

The piece from r[0] to r[1] actually could be a discontinuity in a quantile, but we still won't call quantilePiece on this as long as both r[0].p == 0 and q[0].p == 0 (because then r[1].p will also be zero, and we'll either fall into the r[j].p < q[i].p case or the r[j].p == q[i].p case).

So the assumption we're relying on for the i=1, j=1 case is that both r[0].p == 0 and q[0].p == 0 (or more generally r[0].p == q[0].p). I added this to the comment.


pkg/sql/stats/quantile.go line 540 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

👍

nit: add a comment explaining the op argument.

Done.


pkg/sql/stats/quantile.go line 514 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: add a function comment

Done.


pkg/sql/stats/quantile_test.go line 1254 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Oh, great idea. I added a randQuantile function for some tests in the other file. I'll swing around and do this tomorrow.

Ok, added TestQuantileOpsRandom.


pkg/sql/stats/quantile_test.go line 250 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Does this ensure that the resulting quantile is well-formed? Should we produce a malformed quantile some fraction of the time?

Oh, good point. Now we produce a malformed quantile some of the time.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)


pkg/sql/stats/quantile.go line 529 at r5 (raw file):

			// This piece of r will never be a discontinuity. Proof:
			// 1. We visited r[j-1] before q[i], implying r[j-1].p <= q[i].p. (For the
			//    i=1 j=1 case, this relies on the assumption that q[0].p = r[0].p.)

nit: Maybe make it explicit that q[0].p = r[0].p = 0?

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)


pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Should you add an assertion that y has the same length?

Just want to make sure this comment didn't get lost


pkg/sql/stats/simple_linear_regression.go line 136 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto, should we assert y has the same length?

ditto

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rytaft)


pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Just want to make sure this comment didn't get lost

How would you feel about a panic if the lengths are different? Initially this function will only be called from one place (ignoring testing), so I'd rather not add an error to the function signature for this.

Statistics forecasting will initially use simple linear regression over
time to predict all table statistics (only keeping predictions that fit
the linear model well). Most table statistics are scalar float64s, for
which we can use a textbook ordinary least squares (OLS) method with x
as time and y as the table statistic.

For histograms, we use a variant of OLS based on the 2010 paper
"Ordinary Least Squares for Histogram Data Based on Wasserstein
Distance" by Verde and Irpino. The paper outlines an OLS method when
both x and y are histograms. In our case, x is a scalar, so we adjust
the math slightly.

Assists: cockroachdb#79872

Release note: None
Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs! I really appreciate the detailed reads!!!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner and @rytaft)


pkg/sql/stats/quantile.go line 529 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: Maybe make it explicit that q[0].p = r[0].p = 0?

Added to the comment.


pkg/sql/stats/simple_linear_regression.go line 136 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto

Added a panic.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @michae2)


pkg/sql/stats/simple_linear_regression.go line 40 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

How would you feel about a panic if the lengths are different? Initially this function will only be called from one place (ignoring testing), so I'd rather not add an error to the function signature for this.

SGTM

@craig
Copy link
Contributor

craig bot commented Jul 25, 2022

Build succeeded:

@craig craig bot merged commit 69e5187 into cockroachdb:master Jul 25, 2022
@mgartner
Copy link
Collaborator

pkg/sql/stats/quantile.go line 508 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

The piece from r[0] to r[1] actually could be a discontinuity in a quantile, but we still won't call quantilePiece on this as long as both r[0].p == 0 and q[0].p == 0 (because then r[1].p will also be zero, and we'll either fall into the r[j].p < q[i].p case or the r[j].p == q[i].p case).

So the assumption we're relying on for the i=1, j=1 case is that both r[0].p == 0 and q[0].p == 0 (or more generally r[0].p == q[0].p). I added this to the comment.

I see. Thanks for explaining!

@michae2 michae2 deleted the quantile branch July 26, 2022 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants