-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedulerlatency: export Go scheduling latency metric #87883
Conversation
d2dc724
to
91729ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 12 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @irfansharif)
pkg/util/metric/metric.go
line 86 at r6 (raw file):
// WindowedHistogram represents a histogram with data over recent window of // time.
So this is not allowed to be cumulative? Or the window could be arbitrary and overlapping? The windowed term is a bit confusing.
pkg/util/metric/metric.go
line 571 at r6 (raw file):
// Should not extrapolate past the upper bound of the largest bucket. if val > *buckets[len(buckets)-1].UpperBound {
can you add some invariants as code comments earlier in this function? It is not clear why we need this if-condition because I was expecting an invariant that rank <= count, and bucketEnd <= buckets[len(buckets)-1].UpperBound. If this is just defensive please note it in a code comment.
pkg/util/schedulerlatency/histogram.go
line 62 at r6 (raw file):
mult: float64(time.Second.Nanoseconds()), } h.mu.buckets = buckets
Is +Inf allowed for Prometheus histograms?
pkg/util/schedulerlatency/histogram.go
line 101 at r6 (raw file):
totalCount += count if count != 0 { // N.B. This computed sum is an underestimate.
underestimate because h.mu.buckets[i]
is the lower bound for the bucket?
pkg/util/schedulerlatency/histogram.go
line 161 at r6 (raw file):
// reBucketExpAndTrim takes a list of bucket boundaries (lower bound inclusive) // and downsamples the buckets to those a multiple of base apart. The end result
I didn't quite understand the base parameter. Can you add a small example as a code comment.
859c67e
to
17b26d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, and @sumeerbhola)
pkg/util/metric/metric.go
line 86 at r6 (raw file):
Previously, sumeerbhola wrote…
So this is not allowed to be cumulative? Or the window could be arbitrary and overlapping? The windowed term is a bit confusing.
It's not allowed to be cumulative, no. The windows can also be arbitrary and overlapping. In our histogram for ex. the window is the last sample_duration
-ish worth of data; in the other histogram type defined in this pkg/util/metric package, there's a rollover of this window that occurs every few ticks. This PR didn't introduce this vocabulary, though the interface is needed given the new runtimeHistogram
type. That type will likely be used to capture other statistics from the Go runtime, like the distribution individual GC-related stop-the-world pause latencies (/gc/pauses:seconds). I've added a comment explaining some of the motivation behind this windowing.
// [...]. It's used primarily to record histogram data into CRDB's internal
// time-series database, which does not know how to encode cumulative
// histograms. What it does instead is scrape off sample count, sum of values,
// and values at specific quantiles from "windowed" histograms and record that
// data directly. These windows could be arbitrary and overlapping.
pkg/util/metric/metric.go
line 571 at r6 (raw file):
This is pure code movement from above, not code I introduced. Looks like @aadityasondhi lifted it off of https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L75, mostly unchanged.
rank <= count
It's not just defensive code, that invariant doesn't hold given this right above:
n := float64(*histogram.SampleCount)
// ...
rank := uint64(((q / 100) * n) + 0.5)
I'm not entirely sure what the 0.5 is doing here, or why it's needed (if at all), but I'll leave it to the @cockroachdb/obs-inf-prs folks to improve commentary here.
pkg/util/schedulerlatency/histogram.go
line 62 at r6 (raw file):
Previously, sumeerbhola wrote…
Is +Inf allowed for Prometheus histograms?
It is allowed, see https://prometheus.io/docs/concepts/metric_types/#histogram. Just sanity checking against /_status/vars
, I see the following:
# HELP go_scheduler_latency Go scheduling latency
# TYPE go_scheduler_latency histogram
go_scheduler_latency_bucket{le="53247.99999999999"} 153
go_scheduler_latency_bucket{le="59392"} 156
go_scheduler_latency_bucket{le="65535.99999999999"} 159
go_scheduler_latency_bucket{le="73727.99999999999"} 160
...
go_scheduler_latency_bucket{le="7.549747199999999e+07"} 167
go_scheduler_latency_bucket{le="8.388607999999999e+07"} 167
go_scheduler_latency_bucket{le="9.646899199999999e+07"} 167
go_scheduler_latency_bucket{le="1.0905190399999999e+08"} 167
go_scheduler_latency_bucket{le="+Inf"} 167
go_scheduler_latency_sum 8.2944e+06
go_scheduler_latency_count 167
pkg/util/schedulerlatency/histogram.go
line 101 at r6 (raw file):
Previously, sumeerbhola wrote…
underestimate because
h.mu.buckets[i]
is the lower bound for the bucket?
Yes, added a comment. This code was lifted off of here BTW: https://github.com/prometheus/client_golang/blob/5b7e8b2e6716df0ceda9df81feb15910c7efa150/prometheus/go_collector_latest.go#L544-L546.
pkg/util/schedulerlatency/histogram.go
line 161 at r6 (raw file):
Previously, sumeerbhola wrote…
I didn't quite understand the base parameter. Can you add a small example as a code comment.
This code was lifted off of here BTW: https://github.com/prometheus/client_golang/blob/5b7e8b2e6716df0ceda9df81feb15910c7efa150/prometheus/internal/go_runtime_metrics.go#L110. There's some datadriven test code in this package that has interactive examples, copied a snippet over into this comment.
17b26d3
to
ff6d42c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @sumeerbhola)
pkg/util/metric/metric.go
line 571 at r6 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This is pure code movement from above, not code I introduced. Looks like @aadityasondhi lifted it off of https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L75, mostly unchanged.
rank <= count
It's not just defensive code, that invariant doesn't hold given this right above:
n := float64(*histogram.SampleCount) // ... rank := uint64(((q / 100) * n) + 0.5)I'm not entirely sure what the 0.5 is doing here, or why it's needed (if at all), but I'll leave it to the @cockroachdb/obs-inf-prs folks to improve commentary here.
@irfansharif is correct about the invariant not holding. This is because the SampleCount
includes the implicit +Inf bucket but the buckets[len(buckets)-1].UpperBound
refers to the largest bucket defined by us. This is the case because the client library doesn't give us access to the +Inf bucket which Prometheus uses under the hood. If the quantile is high enough, the val
surpasses the upper bound of the largest bucket. In this case, using that interpolated value seemed wrong as we would be extrapolating the value. Additionally, if for a specific metric, we see our q99 values to be at the bucket boundary, it provides a strong indicator for us to adjust the buckets to provide a more accurate histogram. FWIW the Prometheus client library does the same when the resulting value is in the +Inf bucket and returns the upper bound of the second last bucket: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L103. It is cleaner/easier for them since they have access to the +Inf bucket internally.
The 0.5 was added for rounding purposes. I went back and forth on whether to have it at all but thought it made sense for smaller SampleCount
cases like in this test:
cockroach/pkg/util/metric/metric_test.go
Line 137 in a82aa82
require.Equal(t, 100.0, h.ValueAtQuantileWindowed(99)) |
Open to suggestions if you think we should not be rounding and should just go with the truncated value when converted to an int. If we decide to keep this, I can add a line there to explain why we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 50 files at r4, 5 of 28 files at r5, 4 of 12 files at r6, 44 of 44 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, @irfansharif, and @sumeerbhola)
pkg/util/metric/metric.go
line 571 at r6 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
@irfansharif is correct about the invariant not holding. This is because the
SampleCount
includes the implicit +Inf bucket but thebuckets[len(buckets)-1].UpperBound
refers to the largest bucket defined by us. This is the case because the client library doesn't give us access to the +Inf bucket which Prometheus uses under the hood. If the quantile is high enough, theval
surpasses the upper bound of the largest bucket. In this case, using that interpolated value seemed wrong as we would be extrapolating the value. Additionally, if for a specific metric, we see our q99 values to be at the bucket boundary, it provides a strong indicator for us to adjust the buckets to provide a more accurate histogram. FWIW the Prometheus client library does the same when the resulting value is in the +Inf bucket and returns the upper bound of the second last bucket: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L103. It is cleaner/easier for them since they have access to the +Inf bucket internally.The 0.5 was added for rounding purposes. I went back and forth on whether to have it at all but thought it made sense for smaller
SampleCount
cases like in this test:cockroach/pkg/util/metric/metric_test.go
Line 137 in a82aa82
require.Equal(t, 100.0, h.ValueAtQuantileWindowed(99))
Open to suggestions if you think we should not be rounding and should just go with the truncated value when converted to an int. If we decide to keep this, I can add a line there to explain why we do this.
@aadityasondhi it would be great to have a long code comment stating what you wrote above.
pkg/util/metric/metric.go
line 550 at r7 (raw file):
} rank := uint64(((q / 100) * n) + 0.5)
suggestion for @aadityasondhi
// Since we are missing the +Inf bucket, CumulativeCounts may never exceed rank. By omitting the highest bucket we have from the search, the failed search will land on that last bucket and we don't have to do any special checks regarding landing on a non-existent bucket.
pkg/util/metric/metric.go
line 553 at r7 (raw file):
b := sort.Search(len(buckets)-1, func(i int) bool { return *buckets[i].CumulativeCount >= rank }) var (
suggestion for @aadityasondhi
// bucketStart defaults to 0, which we assume is the lower bound of the smallest bucket.
pkg/util/schedulerlatency/histogram.go
line 205 at r7 (raw file):
if i+2 > len(b) { break }
// Invariant: i+2 <= len(b)
so if i+2==len(b) we will index out of bounds in the next if-block?
pkg/util/schedulerlatency/histogram.go
line 212 at r7 (raw file):
// We're looking at the two before after the first one that's crossed the // min limit. Since we expect recordings at the min value, we don't want // to have that bucket start at -Inf, so we merge the buckets before.
I had a hard time understanding this comment so I tried to rewrite it in a different way:
b[i+2] > min and is the lower-bound of the i+2 bucket. So min must be contained in bucket i+1 or earlier. Note that since we start with i=0, it is possible that we never checked that b[i+1] <= min, but if we did, min will be in bucket i+1. Let this min containing bucket be bucket j. Bucket j-1 should be set to a lower-bound of -Inf.
Which suggests that the logic is not quite right, since it will miss the case where b[0] contained the min. Though that corner case is not very interesting. But we should either fix that corner case, or explain that we are ignoring it.
pkg/util/schedulerlatency/histogram.go
line 236 at r7 (raw file):
// From now on, bucket should always have a non-Inf value because // Infs are only ever at the ends of the bucket lists, so // arithmetic operations on it are non-NaN.
How about a comment like the following:
bucket is the lower bound of the lowest bucket that has not been added to newBuckets. We will add it to newBuckets, but we wait to add it until we find the next bucket that is >= bucket*base.
pkg/util/schedulerlatency/histogram.go
line 248 at r7 (raw file):
} // The +Inf bucket will always be the last one, and we'll always // end up including it here because bucket
missing end of sentence.
Could use a comment like:
+Inf bucket is guaranteed to be added: When i is at the +Inf bucket we will fall through to here, and add the existing value of bucket and set bucket to this +Inf bucket, and then exit the for loop, and proceed to add the +Inf bucket in the append that follows the loop.
pkg/util/schedulerlatency/testdata/runtime_histogram
line 2 at r7 (raw file):
# Explore how the histogram type that wraps around runtime/metrics behaves. # Initialize it buckets that are:
... it to buckets ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, and @sumeerbhola)
pkg/util/metric/metric.go
line 571 at r6 (raw file):
Previously, sumeerbhola wrote…
@aadityasondhi it would be great to have a long code comment stating what you wrote above.
Just added a version of the above in this PR.
pkg/util/metric/metric.go
line 550 at r7 (raw file):
Previously, sumeerbhola wrote…
suggestion for @aadityasondhi
// Since we are missing the +Inf bucket, CumulativeCounts may never exceed rank. By omitting the highest bucket we have from the search, the failed search will land on that last bucket and we don't have to do any special checks regarding landing on a non-existent bucket.
Just added it in this PR.
pkg/util/metric/metric.go
line 553 at r7 (raw file):
Previously, sumeerbhola wrote…
suggestion for @aadityasondhi
// bucketStart defaults to 0, which we assume is the lower bound of the smallest bucket.
Just added it in this PR.
pkg/util/schedulerlatency/histogram.go
line 205 at r7 (raw file):
Previously, sumeerbhola wrote…
// Invariant: i+2 <= len(b)
so if i+2==len(b) we will index out of bounds in the next if-block?
I've reworked this code to be less sloppy. BTW, I'm really enjoying your airtight reviews -- I should generally be less sloppy but with reviews like this I feel I can get way with more than my usual level of sloppiness, so I'm grateful either way.
pkg/util/schedulerlatency/histogram.go
line 212 at r7 (raw file):
Previously, sumeerbhola wrote…
I had a hard time understanding this comment so I tried to rewrite it in a different way:
b[i+2] > min and is the lower-bound of the i+2 bucket. So min must be contained in bucket i+1 or earlier. Note that since we start with i=0, it is possible that we never checked that b[i+1] <= min, but if we did, min will be in bucket i+1. Let this min containing bucket be bucket j. Bucket j-1 should be set to a lower-bound of -Inf.
Which suggests that the logic is not quite right, since it will miss the case where b[0] contained the min. Though that corner case is not very interesting. But we should either fix that corner case, or explain that we are ignoring it.
Handled the corner case.
pkg/util/schedulerlatency/histogram.go
line 236 at r7 (raw file):
Previously, sumeerbhola wrote…
How about a comment like the following:
bucket is the lower bound of the lowest bucket that has not been added to newBuckets. We will add it to newBuckets, but we wait to add it until we find the next bucket that is >= bucket*base.
Done.
pkg/util/schedulerlatency/histogram.go
line 248 at r7 (raw file):
Previously, sumeerbhola wrote…
missing end of sentence.
Could use a comment like:
+Inf bucket is guaranteed to be added: When i is at the +Inf bucket we will fall through to here, and add the existing value of bucket and set bucket to this +Inf bucket, and then exit the for loop, and proceed to add the +Inf bucket in the append that follows the loop.
Done.
pkg/util/schedulerlatency/testdata/runtime_histogram
line 2 at r7 (raw file):
Previously, sumeerbhola wrote…
... it to buckets ... ?
Done.
And record data into CRDB's internal time-series database. Informs \cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None
ff6d42c
to
e397faf
Compare
Canceled. |
bors r+ |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment nits
Reviewed 4 of 4 files at r8, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/server/server.go
line 1470 at r8 (raw file):
// We can now add the node registry. s.recorder.AddNode( // XXX: Has to occur before
leftover partial comment
pkg/util/metric/metric.go
line 600 at r8 (raw file):
// The 0.5 was added for rounding purposes. I went back and forth on whether to // have it at all but thought it made sense for smaller SampleCount cases like // in this test:
the reference to the test is missing
pkg/util/schedulerlatency/histogram.go
line 210 at r8 (raw file):
} // b[j] > min and is the lower-bound of the j-th bucket. The min must be // contained in the (j-1)-th bucket or earlier. We want to merge 0th bucket
the "or earlier" part is no longer true since you reworked this logic. Which is important since the min being in the (j-1)-th bucket for sure makes the logic here that trims the bucket list to start at the (j-2)-th bucket correct.
pkg/util/schedulerlatency/histogram.go
line 216 at r8 (raw file):
} else { // We want trim the bucket list to start at (j-2)-th bucket, so just // have bucket before the one containing the min.
... have one bucket before the one ...
And record data into CRDB's internal time-series database. Informs
#82743 and #87823. To export scheduling latencies to prometheus, we
choose an exponential bucketing scheme with base multiple of 1.1, and
the output range bounded to [50us, 100ms). This makes for ~70 buckets.
It's worth noting that the default histogram buckets used in Go are
not fit for our purposes. If we care about improving it, we could
consider patching the runtime.
Release note: None
Release justification: observability-only PR, low-risk high-benefit; would help understand admission control out in the wild