-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add SHOW STATISTICS WITH FORECAST #77070
Conversation
This is now compiling and not crashing on some example statistics, so I wanted to open a draft PR as a sneak preview. It still needs tests before it is ready for review. PRs after this will wire the forecasts into the stats_cache. |
Added some tests, fixed some bugs. A few tests still need to be written. |
This is RFAL. I need to write two more tests, but that can happen concurrently with reviewing. |
Make some linters happy. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass review, but need to look at forecast and forecast_test in more detail. Looks good so far!
Reviewed 4 of 13 files at r1, 3 of 9 files at r2, 6 of 8 files at r3, 2 of 4 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)
pkg/sql/show_stats.go, line 154 at r3 (raw file):
} if _, withForecast := opts[showTableStatsOptForecast]; withForecast {
I think this chunk should get moved below the defer func()
for recovery.
pkg/sql/sem/tree/datum.go, line 784 at r4 (raw file):
// DMaxInt is the maximum DInt. DMaxInt = NewDInt(math.MaxInt64) // DMinInt is the maximum DInt.
s/maximum/minimum
pkg/sql/sem/tree/datum.go, line 2466 at r4 (raw file):
} if t.After(MaxSupportedTime) { return &DTimestamp{Time: MaxSupportedTime.Round(precision)}
I was playing with this a bit, and if we round MaxSupportedTime to a precision of say a minute, it's going to round 294276-12-31 23:59:59.999999 up to 294277-01-01 00:00:00, which is outside of our timestamp bounds. Is this behavior acceptable? If not, I wonder if time.Truncate
would be better here.
pkg/sql/sem/tree/datum.go, line 2763 at r4 (raw file):
} if t.After(MaxSupportedTime) { return &DTimestampTZ{Time: MaxSupportedTime.Round(precision)}
Same issue here with rounding up over MaxSupportedTime.
pkg/sql/sem/tree/decimal.go, line 96 at r4 (raw file):
// MinDecimalForWidth returns the minimum decimal that fits within the given // precision and scale. If precision is zero it returns negative infinite.
minor nit: s/infinite/infinity
pkg/sql/sem/tree/decimal.go, line 116 at r4 (raw file):
// MaxDecimalForWidth returns the maximum decimal that fits within the given // precision and scale. If precision is zero it returns positive infinite.
minor nit: s/infinite/infinity
pkg/sql/stats/automatic_stats.go, line 533 at r4 (raw file):
func avgRefreshTime(tableStats []*TableStatistic) time.Duration { var reference *TableStatistic var protos []*TableStatisticProto
It seems like we only need the CreatedAt
field at this time, maybe we should just store and pass a slice of timestamps instead of the whole proto.
pkg/sql/stats/forecast.go, line 41 at r4 (raw file):
// We only use a predicted value in a forecast if the R² (goodness of fit) // measurement of the predictive model is high enough. Otherwise we use the // latest observed value. This means our forecasts are a mixture of predicted
For debugging purposes, how do we differentiate what type of prediction a forecast has?
pkg/sql/stats/forecast.go, line 43 at r4 (raw file):
// latest observed value. This means our forecasts are a mixture of predicted // parts and observed parts. const minGoodnessOfFit = 0.95
How did you pick .95?
pkg/sql/stats/forecast.go, line 49 at r4 (raw file):
// valuable than others. We call the weighted number of predicted parts the // score of a forecast. const minForecastScore = 1
Does this mean that 100% of the parts need to be predicted via the forecast? I don't understand from the comment how this const will actually be used.
pkg/sql/logictest/testdata/logic_test/distsql_stats, line 635 at r4 (raw file):
__auto__ {e} __auto__ {e} __forecast__ {a}
Why is there no forecast for the other columns?
pkg/sql/stats/forecast_test.go, line 50 at r4 (raw file):
testCases := []struct { x, y []float64 xn, yn, r2 float64
What does r2 mean?
pkg/sql/stats/forecast_test.go, line 672 at r4 (raw file):
typ: types.Decimal, dat: makeDDecimal("42"), val: 42,
The answer to everything!
pkg/sql/stats/forecast_test.go, line 799 at r4 (raw file):
val: 0, }, // Hmm... tree.DMinInterval and tree.DMaxInterval can't be encoded. Skip
Is this a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! I'll need some more time to get through all of this, but I've left some initial comments for now.
I'm curious to know your thoughts behind the utility of the WITH FORECAST
option? Do you see a need for users or us to see stats both with and without forecasts? If not, I think we can remove the additional syntax. If so, I'm assuming the optimizer will use the forecasted stats by default, so it might be more clear to show the forecasted stats by default and add a WITHOUT FORECAST
option to SHOW STATS
. If we plan to turn on/off forecasting in the optimizer with a session setting, then maybe it makes the most sense to have SHOW STATS
use the same session setting to determine whether or not to show forecasted stats.
Reviewed 3 of 13 files at r1, 1 of 9 files at r2, 3 of 8 files at r3, 1 of 4 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)
pkg/sql/sem/tree/decimal.go, line 103 at r4 (raw file):
// Use +1 here because it is inverted later. if scale < math.MinInt32+1 || scale > math.MaxInt32 { return nil, errScaleOutOfRange
These bounds are checked by types.MakeDecimal
. So rather than make these types take and arbitrary precision
and scale
, you could make these functions take a *types.T
and pull the scale and precision out of the type. Then you wouldn't need to repeat these checks. You'd want to error/panic if the type was not a Decimal.
pkg/sql/stats/automatic_stats.go, line 563 at r4 (raw file):
} func avgStatTime(stats []*TableStatisticProto) time.Duration {
nit: avgStatTime
sounds more like the time it took to collect the stats. What about avgStatPeriod
("period" as in the reciprocal of frequency).
pkg/sql/stats/forecast.go, line 53 at r4 (raw file):
// forecastColumnStatistics, makeQuantile, and quantile.toHistogramData will // need to be changed if we introduce a new histogram version. const _ uint = 1 - uint(histVersion)
Neat trick!
pkg/sql/stats/forecast.go, line 76 at r4 (raw file):
// finding exactly that v is p₂ - p₁. To put it in terms of our histograms: // NumRange = 0 becomes a vertical line, NumRange > 0 becomes a slanted line, // NumEq = 0 goes away, and NumEq > 0 becomes a horizontal line.
nit: some concrete examples in this description might be helpful. you can also cite external resources if you think any would be helpful for building an understanding of quantile functions.
pkg/sql/stats/forecast.go, line 124 at r4 (raw file):
// Group observed statistics by columnset. var forecastCols []string observedByCols := make(map[string][]*TableStatisticProto)
An alternative to this columnset->string hack would be to use a slice of column sets where then index of each column set in the slice is a key into a map[int][]*TableStatisticsProto
. Getting and setting values would have to scan the slice looking for an equal column set, but it's probably not a problem in this use case. You could encapsulate this in a struct and make the API nice and clean.
pkg/sql/stats/forecast.go, line 163 at r4 (raw file):
// statistics + the average time between collections, which should be roughly // when the next stats collection will occur. at := latest.Add(avgStatTime(observedCol0[:n]))
nit: It might be safer to forecast to latest statistics + the maximum time between collections
. For example, if traffic ebbs and floods, collections will follow the pattern, like:
C1..........C2..............................C3..........C4
With average stat time we'd forecast like:
C1..........C2..............................C3..........C4................F
With maximum stat time we'd forecast like:
C1..........C2..............................C3..........C4..............................F
pkg/sql/stats/forecast.go, line 166 at r4 (raw file):
// If any columnset has a valuable forecast, then we must also use forecasts // for the rest of the columnsets in the table (even if those forecasts are
Why must we use forecasts all column sets? Is there a reason to calculate the forecasts for all column sets together? Could we avoid the need for the observedByCols
map and the columnset->string key hack by forecasting for each column set independently?
pkg/sql/stats/forecast.go, line 189 at r4 (raw file):
cols := strconv.FormatUint(uint64(stat.ColumnIDs[0]), 10) for i := 1; i < len(stat.ColumnIDs); i++ { cols += " " + strconv.FormatUint(uint64(stat.ColumnIDs[i]), 10)
Do stats.ColumnIDs
have a guaranteed ordering?
pkg/sql/stats/forecast.go, line 399 at r4 (raw file):
yHat := α + β*x[i] ε := y[i] - yHat Σεε += ε * ε
I don't have too strong of an opinion on the usage of these special characters in code, but it's not something I've seen us use elsewhere. For example, in statistics_builder.go
we spell out epsilon
. But I suppose it makes it easy to map these variables to the equations in comments, which is worth the extra work to type these characters when we update this code. Was that your main motivation?
b579f46
to
c903654
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the first passes! 😁
There was an issue with forecastColumnStatistics
I had to fix (we were not calling histogram.adjustCounts
when using the latest observed histogram instead of the predicted histogram). So I've made some changes, but have not yet addressed all your excellent comments. I will do that tomorrow.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, and @rytaft)
pkg/sql/show_stats.go, line 154 at r3 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
I think this chunk should get moved below the
defer func()
for recovery.
Good point, done.
pkg/sql/sem/tree/datum.go, line 784 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
s/maximum/minimum
Done.
pkg/sql/sem/tree/datum.go, line 2466 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
I was playing with this a bit, and if we round MaxSupportedTime to a precision of say a minute, it's going to round 294276-12-31 23:59:59.999999 up to 294277-01-01 00:00:00, which is outside of our timestamp bounds. Is this behavior acceptable? If not, I wonder if
time.Truncate
would be better here.
Ahhh good catch! Let me work on this.
pkg/sql/sem/tree/decimal.go, line 96 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
minor nit: s/infinite/infinity
Done.
pkg/sql/sem/tree/decimal.go, line 103 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
These bounds are checked by
types.MakeDecimal
. So rather than make these types take and arbitraryprecision
andscale
, you could make these functions take a*types.T
and pull the scale and precision out of the type. Then you wouldn't need to repeat these checks. You'd want to error/panic if the type was not a Decimal.
Oh, good point! I'll try this.
pkg/sql/sem/tree/decimal.go, line 116 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
minor nit: s/infinite/infinity
Done.
pkg/sql/stats/forecast.go, line 41 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
For debugging purposes, how do we differentiate what type of prediction a forecast has?
Currently the logging is the only way. Maybe we need something better?
pkg/sql/stats/forecast.go, line 43 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
How did you pick .95?
Just a random hunch, nothing scientific.
pkg/sql/stats/forecast.go, line 53 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Neat trick!
🙂
pkg/sql/stats/forecast.go, line 76 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: some concrete examples in this description might be helpful. you can also cite external resources if you think any would be helpful for building an understanding of quantile functions.
I've added an example and a link to Wikipedia, hopefully that helps a bit.
pkg/sql/stats/forecast.go, line 124 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
An alternative to this columnset->string hack would be to use a slice of column sets where then index of each column set in the slice is a key into a
map[int][]*TableStatisticsProto
. Getting and setting values would have to scan the slice looking for an equal column set, but it's probably not a problem in this use case. You could encapsulate this in a struct and make the API nice and clean.
Oh, nice. Let me try this tomorrow.
pkg/sql/stats/forecast.go, line 163 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: It might be safer to forecast to
latest statistics + the maximum time between collections
. For example, if traffic ebbs and floods, collections will follow the pattern, like:C1..........C2..............................C3..........C4
With average stat time we'd forecast like:
C1..........C2..............................C3..........C4................F
With maximum stat time we'd forecast like:
C1..........C2..............................C3..........C4..............................F
Hmm, good point. Maybe when @rytaft is back we can discuss this. I don't have any intuition about which would be better.
pkg/sql/stats/forecast.go, line 166 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Why must we use forecasts all column sets? Is there a reason to calculate the forecasts for all column sets together? Could we avoid the need for the
observedByCols
map and the columnset->string key hack by forecasting for each column set independently?
We need all of the RowCounts for a table to match. And then, related to that, if the forecast has a new row count, we need all of the histogram counts to add up to the new row count.
pkg/sql/stats/forecast.go, line 189 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Do
stats.ColumnIDs
have a guaranteed ordering?
🤔
pkg/sql/stats/forecast.go, line 399 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
I don't have too strong of an opinion on the usage of these special characters in code, but it's not something I've seen us use elsewhere. For example, in
statistics_builder.go
we spell outepsilon
. But I suppose it makes it easy to map these variables to the equations in comments, which is worth the extra work to type these characters when we update this code. Was that your main motivation?
Yes, it just seemed more... direct? simple? to use ε
instead of epsilon
. And it was fun. 😇 But if you hate it I'll change it back to ASCII, no problem.
pkg/sql/logictest/testdata/logic_test/distsql_stats, line 635 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Why is there no forecast for the other columns?
Because the most recent stats only contain a
(they were created with CREATE STATISTICS s8 ON a FROM [$t_id]
).
Hmm. The trouble is that we need to forecast stats for all current columns to get the row counts to match. But we base the "current columns" on the most recent stats. Hmm. Let me think about this.
pkg/sql/stats/forecast_test.go, line 50 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
What does r2 mean?
It's R², a.k.a. the coefficient of determination from the linear regression model. Tells us how well the model fits the observations.
pkg/sql/stats/forecast_test.go, line 672 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
The answer to everything!
"Don't Panic" 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some more comments, but in general this looks great!
Reviewed 1 of 13 files at r1, 2 of 9 files at r2, 3 of 8 files at r3, 1 of 4 files at r4, 11 of 11 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, and @rytaft)
pkg/sql/stats/forecast.go, line 49 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Does this mean that 100% of the parts need to be predicted via the forecast? I don't understand from the comment how this const will actually be used.
+1 The meaning of the score is not clear to me. I think it deserves some more explanation, here, and maybe other places where the score is computed.
pkg/sql/stats/forecast.go, line 76 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I've added an example and a link to Wikipedia, hopefully that helps a bit.
Very helpful, thanks!
pkg/sql/stats/forecast.go, line 107 at r5 (raw file):
// 0 .2 .4 .6 .8 1 // type quantile []quantilePoint
It'd be nice to define the quantile type near its methods, which are currently at the bottom of the file. You could consider moving the quantile struct and methods into a separate file, and the tests to a corresponding test file - it'd be a nice encapsulated API that way.
pkg/sql/stats/forecast.go, line 125 at r5 (raw file):
// table, given the observed statistics. If a forecast cannot be produced (or is // not worth using) an error is returned. ForecastTableStatistics is // deterministic, given the same observations.
nit: Add some more explanation of the general approach to forecasting here. I don't think it needs to be too detailed, but since this is the entry-point to forecasting, a summary in its comment, with pointers to more specific comments would be helpful.
pkg/sql/stats/forecast.go, line 183 at r5 (raw file):
} } }
nit: might be nice to move this check into a helper function
pkg/sql/stats/forecast.go, line 201 at r5 (raw file):
//Forecasted RowCounts will be the same for all columnsets
// assuming that all observations have the same RowCounts, by virtue of
// passing the same minGoodnessOfFit each time.
Isn't this only true if the latest n
collections for each columnset have the same row counts and creation times? The row counts and creation times should be consistent for all column sets if all the collections are automatic, but they might not be if there are manual collections for some subset of columns.
In any case, can we avoid the unnecessary work of calculating the linear regression for row counts, distinct counts, etc. multiple times if we expect the result to be the same for each column set?
pkg/sql/stats/forecast.go, line 206 at r5 (raw file):
for i := range forecastCols { observedCol := observedByCols[forecastCols[i]] forecast, score := forecastColumnStatistics(ctx, observedCol, at, n, minGoodnessOfFit)
nit: if you passed observedCol[:n]
then you wouldn't need to pass n
too
pkg/sql/stats/forecast.go, line 228 at r5 (raw file):
// sorted by CreatedAt, descending (latest observation first). Parts of the // forecast with R² (goodness of fit) < minRequiredFit copy the equivalent part // from the latest observed statistics instead of the prediction. A score is
Can you explain the score
a little more here. It looks like it's either 1
if a histogram was forecasted and 0
otherwise. Can we return a bool instead?
pkg/sql/stats/forecast.go, line 297 at r5 (raw file):
} var histData *HistogramData = observed[0].HistogramData
nit: use :=
instead of var
pkg/sql/stats/forecast.go, line 321 at r5 (raw file):
tableID, columnIDs, observed[i].CreatedAt, err, ) break
Should forecasting be aborted entirely if there is an error here, rather than continuing?
pkg/sql/stats/forecast.go, line 323 at r5 (raw file):
break } quantiles = append(quantiles, q)
nit: you could reduce some code nesting by making a helper function that builds all the quantiles.
pkg/sql/stats/forecast.go, line 337 at r5 (raw file):
ctx, 2, "forecast for table %v columns %v quantile.toHistogram err %v", tableID, columnIDs, err, )
Same question as above, why continue if there is an error?
pkg/sql/stats/forecast.go, line 372 at r5 (raw file):
nnDistinctCount-- } hist.adjustCounts(nil, nnRowCount, nnDistinctCount)
It's not obvious to me why these counts exclude nulls? Is this a requirement of adjustCounts
that isn't mentioned in the function comment?
pkg/sql/stats/forecast.go, line 523 at r5 (raw file):
// Calculate x̅ and y̅(t) (means of x and y). var Σx float64 var Σy quantile = zeroQuantile
nit: use :=
pkg/sql/stats/forecast.go, line 532 at r5 (raw file):
// Estimate α(t) and β(t) given our observations. var Σxx, Σyy float64 var Σxy quantile = zeroQuantile
nit: use :=
pkg/sql/stats/forecast.go, line 540 at r5 (raw file):
} var α quantile = zeroQuantile var β quantile = zeroQuantile
nit: use :=
pkg/sql/stats/forecast.go, line 932 at r5 (raw file):
// quantile function will usually have more points than either q or r, depending // on how their points align. The new quantile function might be malformed if // either q or r were malformed.
nit: Explain a bit more about what it means to add two quantiles. Maybe an example would be helpful.
pkg/sql/stats/forecast.go, line 1016 at r5 (raw file):
// Create a new quantile function which represents q multiplied by a // constant. The new quantile function will be malformed if c is negative.
Just to be clear: we expect c
to be negative in some cases, correct?
pkg/sql/stats/forecast.go, line 1017 at r5 (raw file):
// Create a new quantile function which represents q multiplied by a // constant. The new quantile function will be malformed if c is negative. func (q quantile) scale(c float64) quantile {
nit: should this be called mult
to match add
and sub
?
pkg/sql/stats/forecast.go, line 1135 at r5 (raw file):
// pieces with one endpoint < val and one endpoint > val also crossed the // previous val and are in (a). All pieces with one endpoint == val and one // endpoint > val have the current val as an endpoint. ∎
nit: remove ∎
pkg/sql/stats/forecast_test.go, line 260 at r4 (raw file):
}, xn: -1, yn: quantile{{0, 400}, {0, 405}, {0.25, 407}, {0.25, 409}, {0.5, 413}, {0.5, 425}, {0.75, 431}, {0.75, 434}, {1, 438}},
It would be cool in the future to format quantiles as ASCII graphs and put these in a datadriven test. I couldn't find any library for plotting ASCII graphs though.
pkg/sql/stats/forecast_test.go, line 799 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Is this a TODO?
FWIW I doubt that there are few use cases in the wild of an INTERVAL
ascending key. We could leave out INTERVALs entirely here and leave as future work if it makes it easier to get this PR finished.
pkg/sql/stats/forecast_test.go, line 166 at r5 (raw file):
score: 0, }, // Test consistency adjustments to predicted values.
What are "consistency adjustments"? This test case looks identical to the case above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished looking over the forecasts and only have a few additional comments. Appreciate all the code comments explaining the linear regressions etc.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, and @rytaft)
pkg/sql/stats/forecast.go, line 41 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Currently the logging is the only way. Maybe we need something better?
I guess it's easy enough to see if the last observed stats were used in a forecast, so existing logging is fine.
pkg/sql/stats/forecast.go, line 384 at r5 (raw file):
histData = &hd if predictedHist { score++
Is there ever a situation (e.g., if histograms are disabled) where forecasting stats might be valuable even if there is no histogram data?
pkg/sql/stats/forecast.go, line 849 at r5 (raw file):
return nil, err } // Clamp instead of erring.
Looks like an error is returned anyway?
pkg/sql/stats/forecast.go, line 897 at r5 (raw file):
// Offset 0 means the TimeTZ will always have UTC as the timestamp. return tree.NewDTimeTZFromOffset(t.Round(roundTo), 0), nil case types.TimestampFamily:
If clamping for TimestampFamily
and TimestampTZFamily
are broken, why do we support converting them to/from quantiles and doing forecasts for these column types? It seems like it would make more sense to not support them for now and add them back when the issue is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At long last, this is RFAL. Thanks for your patience. I broke it into a handful of commits (the meat is in the last two).
Changes from the prototype:
- We're no longer trying to forecast all column sets for a table together. Instead, we're forecasting each column set independently. This makes
ForecastTableStatistics
much simpler, but could lead to inconsistent row counts for the same table. I don't think this will be a problem in practice, as statistics builder already has code to deal with this (e.g. from manual stats collected on individual columns). - Got rid of the score thingy. The logic in
forecastColumnStatistics
is much simpler now: if all models have good fit we return a forecast, otherwise we return an error. The exception to this all-or-nothing logic is histograms: if everything else has a good fit, then we won't disqualify the forecast over a poor-fitting (or nonexistent) histogram model, instead we'll copy and adjust the latest observed histogram to match. This allows us to forecast columns whose type cannot be converted to quantile functions. - Added a bunch of simple
forecastColumnStatistics
testcases. I'm planning to add more testing but this seems ready enough to review for now.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, and @rytaft)
pkg/sql/sem/tree/datum.go
line 2466 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Ahhh good catch! Let me work on this.
Done.
pkg/sql/sem/tree/datum.go
line 2763 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Same issue here with rounding up over MaxSupportedTime.
Done.
pkg/sql/sem/tree/decimal.go
line 103 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Oh, good point! I'll try this.
Support for decimals was removed.
pkg/sql/stats/automatic_stats.go
line 533 at r4 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
It seems like we only need the
CreatedAt
field at this time, maybe we should just store and pass a slice of timestamps instead of the whole proto.
I reverted these changes.
pkg/sql/stats/automatic_stats.go
line 563 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit:
avgStatTime
sounds more like the time it took to collect the stats. What aboutavgStatPeriod
("period" as in the reciprocal of frequency).
Reverted this.
pkg/sql/stats/forecast.go
line 49 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
+1 The meaning of the score is not clear to me. I think it deserves some more explanation, here, and maybe other places where the score is computed.
Yes, this score business was really confusing, so it has been removed. Now we either predict everything if it all has good fit, or don't produce a forecast*
*with the exception of histograms. If everything else has a good fit, then we try to predict the histogram. If we cannot, we adjust the latest histogram. This is to allow forecasting over types whose histograms we cannot convert to quantile functions.
pkg/sql/stats/forecast.go
line 124 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Oh, nice. Let me try this tomorrow.
I kept the columnset->string hack. If you hate it, I can change it.
pkg/sql/stats/forecast.go
line 163 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Hmm, good point. Maybe when @rytaft is back we can discuss this. I don't have any intuition about which would be better.
I left this as the average, but now bounded by 1 week past the latest collection. I'm nervous that if we go too far in the future (months? years? decades?) forecasts will become silly.
pkg/sql/stats/forecast.go
line 166 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
We need all of the RowCounts for a table to match. And then, related to that, if the forecast has a new row count, we need all of the histogram counts to add up to the new row count.
Here's one big change from the prototype: now we're no longer forecasting all column sets. We're forecasting (or not forecasting) each column set independently. This means the row counts won't necessarily match, but statistics builder has code to deal with this already, so it seems fine.
pkg/sql/stats/forecast.go
line 189 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
🤔
I shall investigate this.
pkg/sql/stats/forecast.go
line 107 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
It'd be nice to define the quantile type near its methods, which are currently at the bottom of the file. You could consider moving the quantile struct and methods into a separate file, and the tests to a corresponding test file - it'd be a nice encapsulated API that way.
Done.
pkg/sql/stats/forecast.go
line 125 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: Add some more explanation of the general approach to forecasting here. I don't think it needs to be too detailed, but since this is the entry-point to forecasting, a summary in its comment, with pointers to more specific comments would be helpful.
Done, LMK if the comments are helpful or need more.
pkg/sql/stats/forecast.go
line 201 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
//Forecasted RowCounts will be the same for all columnsets
// assuming that all observations have the same RowCounts, by virtue of
// passing the same minGoodnessOfFit each time.Isn't this only true if the latest
n
collections for each columnset have the same row counts and creation times? The row counts and creation times should be consistent for all column sets if all the collections are automatic, but they might not be if there are manual collections for some subset of columns.In any case, can we avoid the unnecessary work of calculating the linear regression for row counts, distinct counts, etc. multiple times if we expect the result to be the same for each column set?
Yes, the manual collections of a subset of columns really threw a wrench in my plan. So now we're just forecasting each column set independently.
pkg/sql/stats/forecast.go
line 206 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: if you passed
observedCol[:n]
then you wouldn't need to passn
too
Done.
pkg/sql/stats/forecast.go
line 228 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Can you explain the
score
a little more here. It looks like it's either1
if a histogram was forecasted and0
otherwise. Can we return a bool instead?
Score was removed.
pkg/sql/stats/forecast.go
line 297 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: use
:=
instead ofvar
Done.
pkg/sql/stats/forecast.go
line 321 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Should forecasting be aborted entirely if there is an error here, rather than continuing?
Yes, and now it is.
pkg/sql/stats/forecast.go
line 323 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: you could reduce some code nesting by making a helper function that builds all the quantiles.
Done.
pkg/sql/stats/forecast.go
line 337 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Same question as above, why continue if there is an error?
Good question. Now we are not continuing.
pkg/sql/stats/forecast.go
line 372 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
It's not obvious to me why these counts exclude nulls? Is this a requirement of
adjustCounts
that isn't mentioned in the function comment?
Yes, this is a requirement of adjustCounts
. I've updated the function comment.
pkg/sql/stats/forecast.go
line 384 at r5 (raw file):
Previously, rharding6373 (Rachael Harding) wrote…
Is there ever a situation (e.g., if histograms are disabled) where forecasting stats might be valuable even if there is no histogram data?
Yes, and now we do.
pkg/sql/stats/forecast_test.go
line 799 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
FWIW I doubt that there are few use cases in the wild of an
INTERVAL
ascending key. We could leave out INTERVALs entirely here and leave as future work if it makes it easier to get this PR finished.
Interval was removed.
pkg/sql/stats/forecast_test.go
line 166 at r5 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
What are "consistency adjustments"? This test case looks identical to the case above.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to know your thoughts behind the utility of the WITH FORECAST option? Do you see a need for users or us to see stats both with and without forecasts?
I do think we will need to see both with and without forecasts. (With for debugging, without for getting stats to inject.)
If so, I'm assuming the optimizer will use the forecasted stats by default, so it might be more clear to show the forecasted stats by default and add a WITHOUT FORECAST option to SHOW STATS. If we plan to turn on/off forecasting in the optimizer with a session setting, then maybe it makes the most sense to have SHOW STATS use the same session setting to determine whether or not to show forecasted stats.
This is a good argument. WITH
was handy because it's built into the parser already, but I could also hack WITHOUT
in there.
How about I address this during the PR adding session / table settings?. I like the idea of showing the same thing the optimizer is seeing.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, @rytaft, and @yuzefovich)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I didn't review this in great detail, so only surface-level nits and questions from me.
Reviewed 17 of 18 files at r7, 5 of 5 files at r8, 2 of 2 files at r9, 3 of 3 files at r10, 6 of 7 files at r11, 10 of 10 files at r12, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, @rytaft, and @yuzefovich)
-- commits
line 17 at r8:
Seems concerning to me, but I'll defer to others.
pkg/sql/show_stats.go
line 113 at r12 (raw file):
const ( tableIDIdx = iota
nit: why this change? It doesn't seem that we're using the corresponding datum anywhere.
pkg/sql/show_stats.go
line 184 at r12 (raw file):
forecasts := stats.ForecastTableStatistics(ctx, p.EvalContext(), observed) // Iterate in reverse order to match the ORDER BY "columnIDs".
I'm a bit confused - I think we only have ORDER BY "createdAt"
.
pkg/sql/stats/histogram.go
line 83 at r7 (raw file):
// []cat.HistogramBucket. func EquiDepthHistogram( evalCtx tree.CompareContext,
nit: should we do s/evalCtx/compareCtx/
or something like that?
pkg/sql/stats/histogram.go
line 320 at r9 (raw file):
} // removeZeroBuckets reemoves any extra zero buckets if we don't need them
nit: s/reemoves/removes/
.
pkg/sql/stats/quantile.go
line 114 at r8 (raw file):
// TODO(michae2): Add support for DECIMAL, TIME, TIMETZ, and INTERVAL. func canMakeQuantile(version HistogramVersion, colType *types.T) bool { if version > 2 {
nit: is this an another safe guard in case someone bumps the histogram version but actually doesn't look at quantile functions? It seems like an overkill to me give the compile-time check above, but up to you.
pkg/sql/stats/stats_cache.go
line 648 at r11 (raw file):
func (tabStat *TableStatistic) nonNullHistogram() histogram { if len(tabStat.Histogram) > 0 && tabStat.Histogram[0].UpperBound == tree.DNull {
Is it always guaranteed that the null bucket is the first one, regardless of a possible index direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFYR!
Added a new commit to the middle of the stack that makes SHOW STATISTICS
more deterministic.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, @rytaft, and @yuzefovich)
pkg/sql/show_stats.go
line 113 at r12 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: why this change? It doesn't seem that we're using the corresponding datum anywhere.
The tableID datum is used when we call stats.NewTableStatisticProto(row)
down below.
pkg/sql/logictest/testdata/logic_test/distsql_stats
line 635 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Because the most recent stats only contain
a
(they were created withCREATE STATISTICS s8 ON a FROM [$t_id]
).Hmm. The trouble is that we need to forecast stats for all current columns to get the row counts to match. But we base the "current columns" on the most recent stats. Hmm. Let me think about this.
Now there are forecasts for the other columns as well.
pkg/sql/stats/forecast.go
line 189 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I shall investigate this.
Great question. As it turns out, column ID ordering was not guaranteed. So I've done two things:
- Added another commit in the middle of the stack to make column ID ordering guaranteed going forward. This will help make
SHOW STATISTICS
output more deterministic, and will also fix some of our internalDELETE
queries which are matching on column IDs. - Switched from my string-building function to the same string-key function used in
create_stats.go
which adds the column IDs to a FastIntMap first.
pkg/sql/stats/histogram.go
line 320 at r9 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/reemoves/removes/
.
Done.
pkg/sql/stats/quantile.go
line 114 at r8 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: is this an another safe guard in case someone bumps the histogram version but actually doesn't look at quantile functions? It seems like an overkill to me give the compile-time check above, but up to you.
This is a guard for mixed-version clusters. In a mixed-version cluster an old-version node could read a row from system.table_statistics
written by a new-version node, and might not understand what to do with it. (As I alluded to in that one commit message I think we need a similar guard in statistics builder, or at least in the stats cache.)
pkg/sql/stats/stats_cache.go
line 648 at r11 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Is it always guaranteed that the null bucket is the first one, regardless of a possible index direction?
Great question. It looks like we're currently always putting the null bucket first, regardless of NULLS LAST
? This could be a bug. Let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 18 files at r7, 17 of 17 files at r13, 5 of 5 files at r14, 2 of 2 files at r15, 3 of 3 files at r16, 6 of 6 files at r17, 7 of 7 files at r18, 8 of 8 files at r19, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, @rharding6373, @rytaft, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Seems concerning to me, but I'll defer to others.
Can you elaborate on your concerns with replacing eval.Context with tree.CompareContext?
-- commits
line 19 at r14:
You mention backporting to protect us in mixed version stated - a 22.1 node should try to read a forecasted stat? I would think it could read it fine because the histogram format doesn't change. But maybe we don't want to read a forecast histogram so that it operates like it would have before the mixed version state?
Edit: Actually I'm confused about your concern - forecasted stats are not persisted (right?) so they should cause no issues for older nodes. So do we need a new histogram version at all then?
-- commits
line 40 at r16:
Can you clarify in what places you'd potentially see a nil
or zero-lenght histogram and document the meaning behind both? It looks likeEquiDepthHistogram
always returns a non-nil histogram, so I'm a bit confused.
pkg/sql/stats/forecast.go
line 124 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I kept the columnset->string hack. If you hate it, I can change it.
I think it's fine to leave for now. In general I'm concerned about a set type that provides a way to retrieve values of the set in order - it might limit our ability to further optimize these sets. But that's irrelevant here.
pkg/sql/stats/forecast.go
line 163 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I left this as the average, but now bounded by 1 week past the latest collection. I'm nervous that if we go too far in the future (months? years? decades?) forecasts will become silly.
An upper bound sounds smart. I wonder if it makes sense for it to be in relation to the current time - if we collected stats 8 days ago, a 1 week upper bound won't help us. But then we could run into the same problem you are worried about... No need to change this now, but something to think about in the future.
pkg/sql/stats/forecast.go
line 184 at r18 (raw file):
if r2 < minRequiredFit { return yₙ, errors.Newf( "predicted %v R² %v below min required R² %v", name, r2, minRequiredFit,
These nice errors are created here, but they are never used in ForecastTableStatistics
. I think this function could return an ok bool
instead of an error - it doesn't really seem like a low goodness of fit is an error in the normal sense - just that an accurate forecast could not be generated.
pkg/sql/stats/forecast_test.go
line 799 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Interval was removed.
👍 Heh, I realized my comment from 5 months ago had an incorrect double negative, oops.
pkg/sql/stats/stats_cache.go
line 648 at r11 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Great question. It looks like we're currently always putting the null bucket first, regardless of
NULLS LAST
? This could be a bug. Let me check.
Oof. 🤞 this doesn't cause problems because the stats cache is shared by all sessions on a node, and null ordering can change query-to-query and session-to-session.
pkg/sql/stats/stats_cache.go
line 617 at r19 (raw file):
NumRange: 0, DistinctRange: 0, UpperBound: tree.DNull,
nit: maybe we can combine this null-bucket adding logic with the function you added below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @michae2, @rharding6373, @rytaft, and @yuzefovich)
-- commits
line 17 at r8:
I think @yuzefovich's concern was with this note on the second commit:
(I suspect that during upgrades to 22.2 the 22.1 statistics builder will choke on these statistics, so maybe we should also backport a version check to 22.1.)
My plan is to test this next week to see if this is actually a problem.
Previously, mgartner (Marcus Gartner) wrote…
You mention backporting to protect us in mixed version stated - a 22.1 node should try to read a forecasted stat? I would think it could read it fine because the histogram format doesn't change. But maybe we don't want to read a forecast histogram so that it operates like it would have before the mixed version state?
Edit: Actually I'm confused about your concern - forecasted stats are not persisted (right?) so they should cause no issues for older nodes. So do we need a new histogram version at all then?
The backport concern isn't about forecasted stats. It's about the double-histograms for trigram indexes permitted by 963deb8. (Sorry, it snuck into this PR, I should have made it a separate PR.) Next week I'll test to see if this is actually a problem.
Previously, mgartner (Marcus Gartner) wrote…
Can you clarify in what places you'd potentially see a
nil
or zero-lenght histogram and document the meaning behind both? It looks likeEquiDepthHistogram
always returns a non-nil histogram, so I'm a bit confused.
I added some comments to the fourth commit.
pkg/sql/show_stats.go
line 184 at r12 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm a bit confused - I think we only have
ORDER BY "createdAt"
.
Ah, yeah, I was missing a commit. Now the fifth commit adds this ORDER BY
.
pkg/sql/stats/forecast.go
line 163 at r4 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
An upper bound sounds smart. I wonder if it makes sense for it to be in relation to the current time - if we collected stats 8 days ago, a 1 week upper bound won't help us. But then we could run into the same problem you are worried about... No need to change this now, but something to think about in the future.
Using the current time would be nice, but it would make forecasts non-deterministic, meaning we would sometimes get different forecasts when loading the stats cache during a cache miss or when using SHOW STATISTICS ... WITH FORECAST
. To make the forecasts deterministic, they must only use information in the observed statistics.
pkg/sql/stats/forecast.go
line 184 at r18 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
These nice errors are created here, but they are never used in
ForecastTableStatistics
. I think this function could return anok bool
instead of an error - it doesn't really seem like a low goodness of fit is an error in the normal sense - just that an accurate forecast could not be generated.
It's a good point, and I thought about this too, but I'm a little reluctant to remove the error messages as they really helped during debugging and testing. I've added a vmodule log message that uses the error, so that we can find out why forecasts weren't generated if we need to.
Maybe if we ever added some kind of FORECAST STATISTICS FOR COLUMN
command they could be shown to the user? Though that seems unlikely.
pkg/sql/stats/histogram.go
line 83 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should we do
s/evalCtx/compareCtx/
or something like that?
Done.
pkg/sql/stats/stats_cache.go
line 648 at r11 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Oof. 🤞 this doesn't cause problems because the stats cache is shared by all sessions on a node, and null ordering can change query-to-query and session-to-session.
Phew, looks like we're ok for now:
[email protected]:26257/defaultdb> CREATE INDEX ON t (a NULLS LAST);
invalid syntax: statement ignored: at or near "last": syntax error: unimplemented: this syntax
SQLSTATE: 0A000
DETAIL: source SQL:
CREATE INDEX ON t (a NULLS LAST)
^
HINT: You have attempted to use a feature that is not yet implemented.
See: https://go.crdb.dev/issue-v/6224/v22.2
pkg/sql/stats/stats_cache.go
line 617 at r19 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: maybe we can combine this null-bucket adding logic with the function you added below?
We could make DecodeHistogramBuckets
call setHistogramBuckets
after it has converted the HistogramData
to []cat.HistogramBucket
but that would incur an extra copy of the buckets which we're normally able to avoid except when forecasting... If it's alright with you, I'll leave it this way for now. (Maybe we'll come back to this when NULLS LAST
can be used on indexes, see below.)
Most uses of eval.Context in the sql/stats package can actually be tree.CompareContext instead, so make the replacement. Release note: None
In 22.2 as of 963deb8 we support multiple histograms for trigram- indexed strings. Let's bump the histogram version for this change, as we may want to know whether multiple histograms are possible for a given row in system.table_statistics. (I suspect that during upgrades to 22.2 the 22.1 statistics builder will choke on these statistics, so maybe we should also backport a version check to 22.1.) Also update avgRefreshTime to work correctly in multiple-histogram cases. Release note: None
Sometimes when adjusting counts down we end up with empty buckets in the histogram. They don't hurt anything, but they take up some memory (and some brainpower when examining test results). So, teach adjustCounts to remove them. Release note: None
After 82b5926 I've been using the convention that nil histogram buckets = no histogram, and non-nil-zero-length histogram buckets = histogram on empty table. This is mostly useful for testing but is also important for forecasting histograms. Fix a spot that wasn't following this convention. Also, add some empty-table testcases and some other testcases for histogram.adjustCounts. Release note: None
Make two changes to produce more deterministic SHOW STATISTICS output: 1. Sort column IDs when creating statistics. We already use `FastIntSet` in both create_stats.go and statistics_builder.go to ignore ordering when gathering statistics by column set, but the column ordering from CREATE STATISTICS leaks into `system.table_statistics` and can affect SQL on that table, such as SHOW STATISTICS and various internal DELETE statements. 2. Order by column IDs and statistic IDs when reading from `system.table_statistics` in both SHOW STATISTICS and the stats cache. This will ensure SHOW STATISTICS always produces the same output, and shows us rows in the same order as the stats cache sees them (well, reverse order of the stats cache). Release note (sql change): Make SHOW STATISTICS output more deterministic.
Add function to forecast table statistics based on observed statistics. These forecasts are based on linear regression models over time. For each set of columns with statistics, we construct a linear regression model over time for each statistic (row count, null count, distinct count, average row size, and histogram). If all models are good fits then we produce a statistics forecast for the set of columns. Assists: cockroachdb#79872 Release note: None
Add a new WITH FORECAST option to SHOW STATISTICS which calculates and displays forecasted statistics along with the existing table statistics. Also, forbid injecting forecasted stats. Assists: cockroachdb#79872 Release note (sql change): Add a new WITH FORECAST option to SHOW STATISTICS which calculates and displays forecasted statistics along with the existing table statistics.
TFYRs!! bors r+ |
👎 Rejected by code reviews |
bors r+ |
Build failed (retrying...): |
Build succeeded: |
sql/stats: replace eval.Context with tree.CompareContext
Most uses of eval.Context in the sql/stats package can actually be
tree.CompareContext instead, so make the replacement.
Release note: None
sql/stats: bump histogram version to 2
In 22.2 as of 963deb8 we support multiple histograms for trigram-
indexed strings. Let's bump the histogram version for this change, as we
may want to know whether multiple histograms are possible for a given
row in system.table_statistics.
(I suspect that during upgrades to 22.2 the 22.1 statistics builder will
choke on these statistics, so maybe we should also backport a version
check to 22.1.)
Also update avgRefreshTime to work correctly in multiple-histogram
cases.
Release note: None
sql/stats: teach histogram.adjustCounts to remove empty buckets
Sometimes when adjusting counts down we end up with empty buckets in the
histogram. They don't hurt anything, but they take up some memory (and
some brainpower when examining test results). So, teach adjustCounts to
remove them.
Release note: None
sql/stats: always use non-nil buckets for empty-table histograms
After 82b5926 I've been using the convention that nil histogram
buckets = no histogram, and non-nil-zero-length histogram buckets =
histogram on empty table. This is mostly useful for testing but is also
important for forecasting histograms.
Fix a spot that wasn't following this convention.
Also, add some empty-table testcases and some other testcases for
histogram.adjustCounts.
Release note: None
sql/stats: make ordering of SHOW STATISTICS more deterministic
Make two changes to produce more deterministic SHOW STATISTICS output:
Sort column IDs when creating statistics. We already use
FastIntSet
in both create_stats.go and statistics_builder.go to ignore ordering
when gathering statistics by column set, but the column ordering from
CREATE STATISTICS leaks into
system.table_statistics
and can affectSQL on that table, such as SHOW STATISTICS and various internal
DELETE statements.
Order by column IDs and statistic IDs when reading from
system.table_statistics
in both SHOW STATISTICS and the statscache. This will ensure SHOW STATISTICS always produces the same
output, and shows us rows in the same order as the stats cache sees
them (well, reverse order of the stats cache).
Release note (sql change): Make SHOW STATISTICS output more
deterministic.
sql/stats: forecast table statistics
Add function to forecast table statistics based on observed statistics.
These forecasts are based on linear regression models over time. For
each set of columns with statistics, we construct a linear regression
model over time for each statistic (row count, null count, distinct
count, average row size, and histogram). If all models are good fits
then we produce a statistics forecast for the set of columns.
Assists: #79872
Release note: None
sql: add SHOW STATISTICS WITH FORECAST
Add a new WITH FORECAST option to SHOW STATISTICS which calculates and
displays forecasted statistics along with the existing table statistics.
Also, forbid injecting forecasted stats.
Assists: #79872
Release note (sql change): Add a new WITH FORECAST option to SHOW
STATISTICS which calculates and displays forecasted statistics along
with the existing table statistics.