-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/stats: add simple linear regression over float64s and quantiles #84659
Conversation
This has been pulled out of #77070 and cleaned up a bit. This is the math-heavy PR. After this, no more math, I swear! 🙂 |
Made a linter happy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos for great documentation! I have some nits, but I didn't review the math too closely, so will defer to others for approval 😄
Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/stats/quantile.go
line 508 at r2 (raw file):
if q[i].p < r[j].p { // Find the value of r within the line segment at q[i].p. m, b := quantilePiece(r[j-1], r[j])
nit: the calculation in these two lines of rv
is repeated several times - maybe extract it out in another helper?
pkg/sql/stats/quantile.go
line 526 at r2 (raw file):
// Handle any trailing p=1 points. for ; i < len(q); i++ {
nit: maybe use len(r)-1
and len(q)-1
in these loops instead of j-1
and i-1
respectively to make things more clear?
pkg/sql/stats/quantile.go
line 688 at r2 (raw file):
// as the original malformed quantile function. fixed := make(quantile, 0, len(q))
nit: should we introduce a fast-path where we check whether q
is malformed and short-circuit if it is well-formed already?
pkg/sql/stats/quantile.go
line 709 at r2 (raw file):
// Right point of pieces which have one endpoint <= val and one endpoint > // val. crossingPieces []quantileIndex
nit: I think we can reduce the number of allocations here. IIUC having only a single slice for crossingPieces
would be sufficient for all values of i
, i.e. we'd do something like crossingPieces := prevCrossingPieces[:0]
and call it day. Ditto for eqPieces
and for intersectionPs
where we can use a "global" scratch. It's probably better to not optimize the code right away, so maybe leave a TODO about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos for great documentation!
+:100:
I left a few nits, but overall looks great!
Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile.go
line 489 at r2 (raw file):
// called on a discontinuity (vertical line where p₁ = p₂). func quantilePiece(c, d quantilePoint) (m, b float64) { m = (d.v - c.v) / (d.p - c.p)
nit: should we return an assertion error if d.p - c.p == 0
?
pkg/sql/stats/quantile.go
line 508 at r2 (raw file):
if q[i].p < r[j].p { // Find the value of r within the line segment at q[i].p. m, b := quantilePiece(r[j-1], r[j])
Is there something implicit here that prevents r[j-1]
to r[j]
from being a discontinuity? I have the same question for the other usages of quantilePiece
below.
pkg/sql/stats/quantile.go
line 540 at r2 (raw file):
// on how their points align. The new quantile function might be malformed, even // if both q and r were well-formed. func (q quantile) sub(r quantile) quantile {
nit: It looks like sub
is identical to add
except for the operator used to combine values. You could combine them into a single implementation that takes either func(x, y float64) float64 { return x + y }
or func(x, y float64) float64 { return x - y }
. (Too bad +
and -
aren't first class functions in Go.)
pkg/sql/stats/simple_linear_regression.go
line 81 at r2 (raw file):
r2 = 1 - Σεε/Σyy } return
nit: return yₙ, r2
pkg/sql/stats/simple_linear_regression.go
line 184 at r2 (raw file):
r2 = 1 - Σεε/Σyy } return
nit: return yₙ, r2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this!
Reviewed 1 of 1 files at r1, 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2)
pkg/sql/stats/quantile.go
line 491 at r2 (raw file):
m = (d.v - c.v) / (d.p - c.p) b = c.v - m*c.p return
nit: even with named return params, we usually try to be explicit and say return m, b
pkg/sql/stats/quantile.go
line 735 at r2 (raw file):
// also moves i past all points with the current val. for ; i < len(pointsByValue) && q[pointsByValue[i]].v == val; i++ { b := pointsByValue[i]
nit: can you use a diff variable name than b
? I think of b
as the intercept now. Maybe j
?
pkg/sql/stats/quantile.go
line 744 at r2 (raw file):
crossingPieces = append(crossingPieces, b+1) } // Also look for flat pieces = val. We only need to check to the left.
nit: left -> right?
pkg/sql/stats/quantile.go
line 788 at r2 (raw file):
lessEqP := intersectionPs[0] for j := 1; j < len(intersectionPs); j += 2 { lessEqP += intersectionPs[j+1] - intersectionPs[j]
Looks like this could cause index out of bounds
pkg/sql/stats/simple_linear_regression.go
line 34 at r2 (raw file):
// * x̅ = (1/n) ∑ xᵢ // * y̅ = (1/n) ∑ yᵢ // * ŷᵢ = α̂ + β̂xᵢ
love all the math symbols! (and the comment is great too!) 😄
pkg/sql/stats/simple_linear_regression.go
line 40 at r2 (raw file):
// background. n := len(x)
Should you add an assertion that y has the same length?
pkg/sql/stats/simple_linear_regression.go
line 136 at r2 (raw file):
// https://en.wikipedia.org/wiki/Earth_mover%27s_distance for some background. n := len(x)
ditto, should we assert y has the same length?
pkg/sql/stats/quantile_test.go
line 1254 at r2 (raw file):
}, } for i, tc := range testCases {
What do you think about also testing a few random quantiles and making sure the invariants hold?
Make a linter happy. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
In addition to responding to various comments, I added random tests of the two simple linear regression functions. This made me realize that fixing malformed quantiles inside quantileSimpleLinearRegression
made the returned R² value overly-pessimistic, and also made the function really hard to test. So now I'll call fixMalformed
on the return value of quantileSimpleLinearRegression
instead of relying on quantileSimpleLinearRegression
to call it for me.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go
line 489 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: should we return an assertion error if
d.p - c.p == 0
?
It kills me a little to add error handling or allocation in a little pure math function like this. (But maybe I am being a crotchety old unsafe C programmer?)
pkg/sql/stats/quantile.go
line 491 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: even with named return params, we usually try to be explicit and say
return m, b
Done.
pkg/sql/stats/quantile.go
line 508 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: the calculation in these two lines of
rv
is repeated several times - maybe extract it out in another helper?
I moved it into quantilePieceV
(and also added a quantilePieceP
used by another place).
pkg/sql/stats/quantile.go
line 508 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Is there something implicit here that prevents
r[j-1]
tor[j]
from being a discontinuity? I have the same question for the other usages ofquantilePiece
below.
Yes, there is (great question):
- We visited
r[j-1]
beforeq[i]
implyingr[j-1].p <= q[i].p
. - Now we're in the
q[i].p < r[j].p
case. - So putting those together, we know that
r[j-1].p <= q[i].p < r[j].p
and thereforer[j-1].p < r[j].p
meaning the piece fromr[j-1]
tor[j]
isn't a discontinuity.
pkg/sql/stats/quantile.go
line 526 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe use
len(r)-1
andlen(q)-1
in these loops instead ofj-1
andi-1
respectively to make things more clear?
Done.
pkg/sql/stats/quantile.go
line 540 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: It looks like
sub
is identical toadd
except for the operator used to combine values. You could combine them into a single implementation that takes eitherfunc(x, y float64) float64 { return x + y }
orfunc(x, y float64) float64 { return x - y }
. (Too bad+
and-
aren't first class functions in Go.)
Good point. I factored out a binaryOp
method, LMK what you think.
pkg/sql/stats/quantile.go
line 688 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should we introduce a fast-path where we check whether
q
is malformed and short-circuit if it is well-formed already?
Oh, great point. Yes, absolutely. Done.
pkg/sql/stats/quantile.go
line 709 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I think we can reduce the number of allocations here. IIUC having only a single slice for
crossingPieces
would be sufficient for all values ofi
, i.e. we'd do something likecrossingPieces := prevCrossingPieces[:0]
and call it day. Ditto foreqPieces
and forintersectionPs
where we can use a "global" scratch. It's probably better to not optimize the code right away, so maybe leave a TODO about this?
Good point. I moved the allocations outside the loop, and this also inspired me to only use a single crossingPieces
slice.
pkg/sql/stats/quantile.go
line 735 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: can you use a diff variable name than
b
? I think ofb
as the intercept now. Maybej
?
Ah yeah, great point. I tried using qi
, WDYT?
pkg/sql/stats/quantile.go
line 744 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: left -> right?
Oh! Good catch! Done.
pkg/sql/stats/quantile.go
line 788 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Looks like this could cause index out of bounds
Oh, I meant to leave a comment about this: intersectionPs
is guaranteed to always have an odd number of items at this point. (Is it too tricky to rely on that? WDYT?)
pkg/sql/stats/simple_linear_regression.go
line 34 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
love all the math symbols! (and the comment is great too!) 😄
I'm glad, I hope it isn't too silly to use all the symbols 😛
pkg/sql/stats/simple_linear_regression.go
line 81 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit:
return yₙ, r2
Done.
pkg/sql/stats/simple_linear_regression.go
line 184 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit:
return yₙ, r2
Done.
pkg/sql/stats/quantile_test.go
line 1254 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
What do you think about also testing a few random quantiles and making sure the invariants hold?
Oh, great idea. I added a randQuantile
function for some tests in the other file. I'll swing around and do this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile.go
line 489 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
It kills me a little to add error handling or allocation in a little pure math function like this. (But maybe I am being a crotchety old unsafe C programmer?)
Feel free to leave as-is.
pkg/sql/stats/quantile.go
line 508 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Yes, there is (great question):
- We visited
r[j-1]
beforeq[i]
implyingr[j-1].p <= q[i].p
.- Now we're in the
q[i].p < r[j].p
case.- So putting those together, we know that
r[j-1].p <= q[i].p < r[j].p
and thereforer[j-1].p < r[j].p
meaning the piece fromr[j-1]
tor[j]
isn't a discontinuity.
Thanks for the explanation!
For the case when i=1, j=1
, there cannot be a discontinuity because it is impossible for the points x[0]
and x[1]
of any quantile function x
to be a discontinuity, correct? Is that a fourth point that we should mention in the comment you added?
pkg/sql/stats/quantile.go
line 540 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Good point. I factored out a
binaryOp
method, LMK what you think.
👍
nit: add a comment explaining the op
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile.go
line 491 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Done.
Doesn't look like you pushed the change (now line 494)
pkg/sql/stats/quantile.go
line 735 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Ah yeah, great point. I tried using
qi
, WDYT?
LGTM
pkg/sql/stats/quantile.go
line 514 at r4 (raw file):
} func (q quantile) binaryOp(r quantile, op func(qv, rv float64) float64) quantile {
nit: add a function comment
pkg/sql/stats/quantile_test.go
line 250 at r4 (raw file):
} sort.Float64s(ps) sort.Float64s(vs)
Does this ensure that the resulting quantile is well-formed? Should we produce a malformed quantile some fraction of the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/stats/quantile.go
line 491 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Doesn't look like you pushed the change (now line 494)
Gah, sorry! Done for real this time.
pkg/sql/stats/quantile.go
line 508 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Thanks for the explanation!
For the case when
i=1, j=1
, there cannot be a discontinuity because it is impossible for the pointsx[0]
andx[1]
of any quantile functionx
to be a discontinuity, correct? Is that a fourth point that we should mention in the comment you added?
The piece from r[0]
to r[1]
actually could be a discontinuity in a quantile, but we still won't call quantilePiece
on this as long as both r[0].p == 0
and q[0].p == 0
(because then r[1].p
will also be zero, and we'll either fall into the r[j].p < q[i].p
case or the r[j].p == q[i].p
case).
So the assumption we're relying on for the i=1, j=1
case is that both r[0].p == 0
and q[0].p == 0
(or more generally r[0].p == q[0].p
). I added this to the comment.
pkg/sql/stats/quantile.go
line 540 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
👍
nit: add a comment explaining the
op
argument.
Done.
pkg/sql/stats/quantile.go
line 514 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: add a function comment
Done.
pkg/sql/stats/quantile_test.go
line 1254 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Oh, great idea. I added a
randQuantile
function for some tests in the other file. I'll swing around and do this tomorrow.
Ok, added TestQuantileOpsRandom
.
pkg/sql/stats/quantile_test.go
line 250 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Does this ensure that the resulting quantile is well-formed? Should we produce a malformed quantile some fraction of the time?
Oh, good point. Now we produce a malformed quantile some of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)
pkg/sql/stats/quantile.go
line 529 at r5 (raw file):
// This piece of r will never be a discontinuity. Proof: // 1. We visited r[j-1] before q[i], implying r[j-1].p <= q[i].p. (For the // i=1 j=1 case, this relies on the assumption that q[0].p = r[0].p.)
nit: Maybe make it explicit that q[0].p = r[0].p = 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2)
pkg/sql/stats/simple_linear_regression.go
line 40 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Should you add an assertion that y has the same length?
Just want to make sure this comment didn't get lost
pkg/sql/stats/simple_linear_regression.go
line 136 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
ditto, should we assert y has the same length?
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rytaft)
pkg/sql/stats/simple_linear_regression.go
line 40 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Just want to make sure this comment didn't get lost
How would you feel about a panic
if the lengths are different? Initially this function will only be called from one place (ignoring testing), so I'd rather not add an error to the function signature for this.
Statistics forecasting will initially use simple linear regression over time to predict all table statistics (only keeping predictions that fit the linear model well). Most table statistics are scalar float64s, for which we can use a textbook ordinary least squares (OLS) method with x as time and y as the table statistic. For histograms, we use a variant of OLS based on the 2010 paper "Ordinary Least Squares for Histogram Data Based on Wasserstein Distance" by Verde and Irpino. The paper outlines an OLS method when both x and y are histograms. In our case, x is a scalar, so we adjust the math slightly. Assists: cockroachdb#79872 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! I really appreciate the detailed reads!!!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner and @rytaft)
pkg/sql/stats/quantile.go
line 529 at r5 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: Maybe make it explicit that
q[0].p = r[0].p = 0
?
Added to the comment.
pkg/sql/stats/simple_linear_regression.go
line 136 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
ditto
Added a panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @michae2)
pkg/sql/stats/simple_linear_regression.go
line 40 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
How would you feel about a
panic
if the lengths are different? Initially this function will only be called from one place (ignoring testing), so I'd rather not add an error to the function signature for this.
SGTM
Build succeeded: |
Previously, michae2 (Michael Erickson) wrote…
I see. Thanks for explaining! |
Statistics forecasting will initially use simple linear regression over
time to predict all table statistics (only keeping predictions that fit
the linear model well). Most table statistics are scalar float64s, for
which we can use a textbook ordinary least squares (OLS) method with x
as time and y as the table statistic.
For histograms, we use a variant of OLS based on the 2010 paper
"Ordinary Least Squares for Histogram Data Based on Wasserstein
Distance" by Verde and Irpino. The paper outlines an OLS method when
both x and y are histograms. In our case, x is a scalar, so we adjust
the math slightly.
Assists: #79872
Release note: None