schedulerlatency: export Go scheduling latency metric #87883

irfansharif · 2022-09-12T21:54:48Z

And record data into CRDB's internal time-series database. Informs
#82743 and #87823. To export scheduling latencies to prometheus, we
choose an exponential bucketing scheme with base multiple of 1.1, and
the output range bounded to [50us, 100ms). This makes for ~70 buckets.
It's worth noting that the default histogram buckets used in Go are
not fit for our purposes. If we care about improving it, we could
consider patching the runtime.

  bucket[  0] width=0s boundary=[-Inf, 0s)
  bucket[  1] width=1ns boundary=[0s, 1ns)
  bucket[  2] width=1ns boundary=[1ns, 2ns)
  bucket[  3] width=1ns boundary=[2ns, 3ns)
  bucket[  4] width=1ns boundary=[3ns, 4ns)
  ...
  bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs)
  bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs)
  bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms)
  bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms)
  bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms)
  ...
  bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s)
  bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s)
  bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s)
  bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf)

Release note: None
Release justification: observability-only PR, low-risk high-benefit; would help understand admission control out in the wild

cockroach-teamcity · 2022-09-12T21:55:06Z

This change is

irfansharif · 2022-09-15T21:46:33Z

This is what it looks like in prometheus and within CRDB.

Hopefully this gives us more data to consider when evaluating the runnable goroutines per-p controller input.

sumeerbhola

Reviewed 6 of 12 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @irfansharif)

pkg/util/metric/metric.go line 86 at r6 (raw file):

// WindowedHistogram represents a histogram with data over recent window of
// time.

So this is not allowed to be cumulative? Or the window could be arbitrary and overlapping? The windowed term is a bit confusing.

pkg/util/metric/metric.go line 571 at r6 (raw file):

	// Should not extrapolate past the upper bound of the largest bucket.
	if val > *buckets[len(buckets)-1].UpperBound {

can you add some invariants as code comments earlier in this function? It is not clear why we need this if-condition because I was expecting an invariant that rank <= count, and bucketEnd <= buckets[len(buckets)-1].UpperBound. If this is just defensive please note it in a code comment.

pkg/util/schedulerlatency/histogram.go line 62 at r6 (raw file):

		mult: float64(time.Second.Nanoseconds()),
	}
	h.mu.buckets = buckets

Is +Inf allowed for Prometheus histograms?

pkg/util/schedulerlatency/histogram.go line 101 at r6 (raw file):

		totalCount += count
		if count != 0 {
			// N.B. This computed sum is an underestimate.

underestimate because h.mu.buckets[i] is the lower bound for the bucket?

pkg/util/schedulerlatency/histogram.go line 161 at r6 (raw file):

// reBucketExpAndTrim takes a list of bucket boundaries (lower bound inclusive)
// and downsamples the buckets to those a multiple of base apart. The end result

I didn't quite understand the base parameter. Can you add a small example as a code comment.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, and @sumeerbhola)

pkg/util/metric/metric.go line 86 at r6 (raw file):

Previously, sumeerbhola wrote…

So this is not allowed to be cumulative? Or the window could be arbitrary and overlapping? The windowed term is a bit confusing.

It's not allowed to be cumulative, no. The windows can also be arbitrary and overlapping. In our histogram for ex. the window is the last sample_duration-ish worth of data; in the other histogram type defined in this pkg/util/metric package, there's a rollover of this window that occurs every few ticks. This PR didn't introduce this vocabulary, though the interface is needed given the new runtimeHistogram type. That type will likely be used to capture other statistics from the Go runtime, like the distribution individual GC-related stop-the-world pause latencies (/gc/pauses:seconds). I've added a comment explaining some of the motivation behind this windowing.

// [...]. It's used primarily to record histogram data into CRDB's internal
// time-series database, which does not know how to encode cumulative
// histograms. What it does instead is scrape off sample count, sum of values,
// and values at specific quantiles from "windowed" histograms and record that
// data directly. These windows could be arbitrary and overlapping.

pkg/util/metric/metric.go line 571 at r6 (raw file):
This is pure code movement from above, not code I introduced. Looks like @aadityasondhi lifted it off of https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L75, mostly unchanged.

rank <= count

It's not just defensive code, that invariant doesn't hold given this right above:

	n := float64(*histogram.SampleCount)
        // ...
	rank := uint64(((q / 100) * n) + 0.5)

I'm not entirely sure what the 0.5 is doing here, or why it's needed (if at all), but I'll leave it to the @cockroachdb/obs-inf-prs folks to improve commentary here.

pkg/util/schedulerlatency/histogram.go line 62 at r6 (raw file):

Previously, sumeerbhola wrote…

Is +Inf allowed for Prometheus histograms?

It is allowed, see https://prometheus.io/docs/concepts/metric_types/#histogram. Just sanity checking against /_status/vars, I see the following:

# HELP go_scheduler_latency Go scheduling latency
# TYPE go_scheduler_latency histogram
go_scheduler_latency_bucket{le="53247.99999999999"} 153
go_scheduler_latency_bucket{le="59392"} 156
go_scheduler_latency_bucket{le="65535.99999999999"} 159
go_scheduler_latency_bucket{le="73727.99999999999"} 160
...
go_scheduler_latency_bucket{le="7.549747199999999e+07"} 167
go_scheduler_latency_bucket{le="8.388607999999999e+07"} 167
go_scheduler_latency_bucket{le="9.646899199999999e+07"} 167
go_scheduler_latency_bucket{le="1.0905190399999999e+08"} 167
go_scheduler_latency_bucket{le="+Inf"} 167
go_scheduler_latency_sum 8.2944e+06
go_scheduler_latency_count 167

pkg/util/schedulerlatency/histogram.go line 101 at r6 (raw file):

Previously, sumeerbhola wrote…

underestimate because h.mu.buckets[i] is the lower bound for the bucket?

Yes, added a comment. This code was lifted off of here BTW: https://github.com/prometheus/client_golang/blob/5b7e8b2e6716df0ceda9df81feb15910c7efa150/prometheus/go_collector_latest.go#L544-L546.

pkg/util/schedulerlatency/histogram.go line 161 at r6 (raw file):

Previously, sumeerbhola wrote…

I didn't quite understand the base parameter. Can you add a small example as a code comment.

This code was lifted off of here BTW: https://github.com/prometheus/client_golang/blob/5b7e8b2e6716df0ceda9df81feb15910c7efa150/prometheus/internal/go_runtime_metrics.go#L110. There's some datadriven test code in this package that has interactive examples, copied a snippet over into this comment.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @sumeerbhola)

pkg/util/metric/metric.go line 571 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This is pure code movement from above, not code I introduced. Looks like @aadityasondhi lifted it off of https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L75, mostly unchanged.

rank <= count

It's not just defensive code, that invariant doesn't hold given this right above:
	n := float64(*histogram.SampleCount)
        // ...
	rank := uint64(((q / 100) * n) + 0.5)
I'm not entirely sure what the 0.5 is doing here, or why it's needed (if at all), but I'll leave it to the @cockroachdb/obs-inf-prs folks to improve commentary here.

@irfansharif is correct about the invariant not holding. This is because the SampleCount includes the implicit +Inf bucket but the buckets[len(buckets)-1].UpperBound refers to the largest bucket defined by us. This is the case because the client library doesn't give us access to the +Inf bucket which Prometheus uses under the hood. If the quantile is high enough, the val surpasses the upper bound of the largest bucket. In this case, using that interpolated value seemed wrong as we would be extrapolating the value. Additionally, if for a specific metric, we see our q99 values to be at the bucket boundary, it provides a strong indicator for us to adjust the buckets to provide a more accurate histogram. FWIW the Prometheus client library does the same when the resulting value is in the +Inf bucket and returns the upper bound of the second last bucket: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L103. It is cleaner/easier for them since they have access to the +Inf bucket internally.

The 0.5 was added for rounding purposes. I went back and forth on whether to have it at all but thought it made sense for smaller SampleCount cases like in this test:

cockroach/pkg/util/metric/metric_test.go

Line 137 in a82aa82

require.Equal(t, 100.0, h.ValueAtQuantileWindowed(99))

Open to suggestions if you think we should not be rounding and should just go with the truncated value when converted to an int. If we decide to keep this, I can add a line there to explain why we do this.

sumeerbhola

Reviewed 6 of 50 files at r4, 5 of 28 files at r5, 4 of 12 files at r6, 44 of 44 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, @irfansharif, and @sumeerbhola)

pkg/util/metric/metric.go line 571 at r6 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

@irfansharif is correct about the invariant not holding. This is because the SampleCount includes the implicit +Inf bucket but the buckets[len(buckets)-1].UpperBound refers to the largest bucket defined by us. This is the case because the client library doesn't give us access to the +Inf bucket which Prometheus uses under the hood. If the quantile is high enough, the val surpasses the upper bound of the largest bucket. In this case, using that interpolated value seemed wrong as we would be extrapolating the value. Additionally, if for a specific metric, we see our q99 values to be at the bucket boundary, it provides a strong indicator for us to adjust the buckets to provide a more accurate histogram. FWIW the Prometheus client library does the same when the resulting value is in the +Inf bucket and returns the upper bound of the second last bucket: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L103. It is cleaner/easier for them since they have access to the +Inf bucket internally.

The 0.5 was added for rounding purposes. I went back and forth on whether to have it at all but thought it made sense for smaller SampleCount cases like in this test:

cockroach/pkg/util/metric/metric_test.go

Line 137 in a82aa82

require.Equal(t, 100.0, h.ValueAtQuantileWindowed(99))

Open to suggestions if you think we should not be rounding and should just go with the truncated value when converted to an int. If we decide to keep this, I can add a line there to explain why we do this.

@aadityasondhi it would be great to have a long code comment stating what you wrote above.

pkg/util/metric/metric.go line 550 at r7 (raw file):

	}

	rank := uint64(((q / 100) * n) + 0.5)

suggestion for @aadityasondhi
// Since we are missing the +Inf bucket, CumulativeCounts may never exceed rank. By omitting the highest bucket we have from the search, the failed search will land on that last bucket and we don't have to do any special checks regarding landing on a non-existent bucket.

pkg/util/metric/metric.go line 553 at r7 (raw file):

	b := sort.Search(len(buckets)-1, func(i int) bool { return *buckets[i].CumulativeCount >= rank })

	var (

suggestion for @aadityasondhi
// bucketStart defaults to 0, which we assume is the lower bound of the smallest bucket.

pkg/util/schedulerlatency/histogram.go line 205 at r7 (raw file):

		if i+2 > len(b) {
			break
		}

// Invariant: i+2 <= len(b)

so if i+2==len(b) we will index out of bounds in the next if-block?

pkg/util/schedulerlatency/histogram.go line 212 at r7 (raw file):

		// We're looking at the two before after the first one that's crossed the
		// min limit. Since we expect recordings at the min value, we don't want
		// to have that bucket start at -Inf, so we merge the buckets before.

I had a hard time understanding this comment so I tried to rewrite it in a different way:

b[i+2] > min and is the lower-bound of the i+2 bucket. So min must be contained in bucket i+1 or earlier. Note that since we start with i=0, it is possible that we never checked that b[i+1] <= min, but if we did, min will be in bucket i+1. Let this min containing bucket be bucket j. Bucket j-1 should be set to a lower-bound of -Inf.

Which suggests that the logic is not quite right, since it will miss the case where b[0] contained the min. Though that corner case is not very interesting. But we should either fix that corner case, or explain that we are ignoring it.

pkg/util/schedulerlatency/histogram.go line 236 at r7 (raw file):

	// From now on, bucket should always have a non-Inf value because
	// Infs are only ever at the ends of the bucket lists, so
	// arithmetic operations on it are non-NaN.

How about a comment like the following:

bucket is the lower bound of the lowest bucket that has not been added to newBuckets. We will add it to newBuckets, but we wait to add it until we find the next bucket that is >= bucket*base.

pkg/util/schedulerlatency/histogram.go line 248 at r7 (raw file):

		}
		// The +Inf bucket will always be the last one, and we'll always
		// end up including it here because bucket

missing end of sentence.

Could use a comment like:
+Inf bucket is guaranteed to be added: When i is at the +Inf bucket we will fall through to here, and add the existing value of bucket and set bucket to this +Inf bucket, and then exit the for loop, and proceed to add the +Inf bucket in the append that follows the loop.

pkg/util/schedulerlatency/testdata/runtime_histogram line 2 at r7 (raw file):

# Explore how the histogram type that wraps around runtime/metrics behaves.
# Initialize it buckets that are:

... it to buckets ... ?

irfansharif

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, and @sumeerbhola)

pkg/util/metric/metric.go line 571 at r6 (raw file):

Previously, sumeerbhola wrote…

@aadityasondhi it would be great to have a long code comment stating what you wrote above.

Just added a version of the above in this PR.

pkg/util/metric/metric.go line 550 at r7 (raw file):

Previously, sumeerbhola wrote…

suggestion for @aadityasondhi
// Since we are missing the +Inf bucket, CumulativeCounts may never exceed rank. By omitting the highest bucket we have from the search, the failed search will land on that last bucket and we don't have to do any special checks regarding landing on a non-existent bucket.

Just added it in this PR.

pkg/util/metric/metric.go line 553 at r7 (raw file):

Previously, sumeerbhola wrote…

suggestion for @aadityasondhi
// bucketStart defaults to 0, which we assume is the lower bound of the smallest bucket.

Just added it in this PR.

pkg/util/schedulerlatency/histogram.go line 205 at r7 (raw file):

Previously, sumeerbhola wrote…

// Invariant: i+2 <= len(b)

so if i+2==len(b) we will index out of bounds in the next if-block?

I've reworked this code to be less sloppy. BTW, I'm really enjoying your airtight reviews -- I should generally be less sloppy but with reviews like this I feel I can get way with more than my usual level of sloppiness, so I'm grateful either way.

pkg/util/schedulerlatency/histogram.go line 212 at r7 (raw file):

Previously, sumeerbhola wrote…

I had a hard time understanding this comment so I tried to rewrite it in a different way:

b[i+2] > min and is the lower-bound of the i+2 bucket. So min must be contained in bucket i+1 or earlier. Note that since we start with i=0, it is possible that we never checked that b[i+1] <= min, but if we did, min will be in bucket i+1. Let this min containing bucket be bucket j. Bucket j-1 should be set to a lower-bound of -Inf.

Which suggests that the logic is not quite right, since it will miss the case where b[0] contained the min. Though that corner case is not very interesting. But we should either fix that corner case, or explain that we are ignoring it.

Handled the corner case.

pkg/util/schedulerlatency/histogram.go line 236 at r7 (raw file):

Previously, sumeerbhola wrote…

How about a comment like the following:

bucket is the lower bound of the lowest bucket that has not been added to newBuckets. We will add it to newBuckets, but we wait to add it until we find the next bucket that is >= bucket*base.

Done.

pkg/util/schedulerlatency/histogram.go line 248 at r7 (raw file):

Previously, sumeerbhola wrote…

missing end of sentence.

Could use a comment like:
+Inf bucket is guaranteed to be added: When i is at the +Inf bucket we will fall through to here, and add the existing value of bucket and set bucket to this +Inf bucket, and then exit the for loop, and proceed to add the +Inf bucket in the append that follows the loop.

Done.

pkg/util/schedulerlatency/testdata/runtime_histogram line 2 at r7 (raw file):

Previously, sumeerbhola wrote…

... it to buckets ... ?

Done.

And record data into CRDB's internal time-series database. Informs \cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None

craig · 2022-09-21T20:06:22Z

Canceled.

irfansharif · 2022-09-21T20:06:35Z

bors r+

craig · 2022-09-21T21:13:59Z

Build succeeded:

Bazel Essential CI (Cockroach)

sumeerbhola

comment nits

Reviewed 4 of 4 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/server/server.go line 1470 at r8 (raw file):

	// We can now add the node registry.
	s.recorder.AddNode( // XXX: Has to occur before

leftover partial comment

pkg/util/metric/metric.go line 600 at r8 (raw file):

	// The 0.5 was added for rounding purposes. I went back and forth on whether to
	// have it at all but thought it made sense for smaller SampleCount cases like
	// in this test:

the reference to the test is missing

pkg/util/schedulerlatency/histogram.go line 210 at r8 (raw file):

	}
	// b[j] > min and is the lower-bound of the j-th bucket. The min must be
	// contained in the (j-1)-th bucket or earlier. We want to merge 0th bucket

the "or earlier" part is no longer true since you reworked this logic. Which is important since the min being in the (j-1)-th bucket for sure makes the logic here that trims the bucket list to start at the (j-2)-th bucket correct.

pkg/util/schedulerlatency/histogram.go line 216 at r8 (raw file):

	} else {
		// We want trim the bucket list to start at (j-2)-th bucket, so just
		// have bucket before the one containing the min.

... have one bucket before the one ...

irfansharif · 2022-09-22T14:46:06Z

#88458 + appended to the 22.2 backport #88403.

irfansharif force-pushed the 220909.admission-obs branch from d2dc724 to 91729ce Compare September 15, 2022 21:42

irfansharif marked this pull request as ready for review September 15, 2022 21:43

irfansharif requested a review from a team as a code owner September 15, 2022 21:43

irfansharif requested a review from a team September 15, 2022 21:43

irfansharif requested a review from a team as a code owner September 15, 2022 21:43

irfansharif requested a review from a team September 15, 2022 21:43

irfansharif requested review from a team as code owners September 15, 2022 21:43

irfansharif requested review from rhu713, sumeerbhola and andrewbaptist and removed request for a team and rhu713 September 15, 2022 21:43

sumeerbhola requested changes Sep 15, 2022

View reviewed changes

irfansharif force-pushed the 220909.admission-obs branch 2 times, most recently from 859c67e to 17b26d3 Compare September 16, 2022 00:37

irfansharif commented Sep 16, 2022

View reviewed changes

irfansharif added the backport-22.2.x label Sep 16, 2022

irfansharif force-pushed the 220909.admission-obs branch from 17b26d3 to ff6d42c Compare September 16, 2022 00:45

irfansharif mentioned this pull request Sep 16, 2022

admission: replace soft slots mechanism with elastic cpu tokens #88032

Closed

3 tasks

aadityasondhi reviewed Sep 16, 2022

View reviewed changes

sumeerbhola approved these changes Sep 16, 2022

View reviewed changes

irfansharif commented Sep 21, 2022

View reviewed changes

irfansharif force-pushed the 220909.admission-obs branch from ff6d42c to e397faf Compare September 21, 2022 20:06

This was referenced Sep 21, 2022

admission: additional observability #82743

Closed

metrics: export go runtime metrics #87823

Closed

craig bot merged commit 766b62d into cockroachdb:master Sep 21, 2022

blathers-crl bot mentioned this pull request Sep 21, 2022

release-22.2: schedulerlatency: export Go scheduling latency metric #88403

Merged

irfansharif deleted the 220909.admission-obs branch September 21, 2022 21:19

sumeerbhola reviewed Sep 22, 2022

View reviewed changes

irfansharif mentioned this pull request Jan 3, 2023

schedulerlatency: implement linear interpolation for percentiles #94193

Merged

irfansharif mentioned this pull request Jan 13, 2023

cdc,roachtest: add test with changefeeds over a large number of ranges #95236

Closed

wenyihu6 mentioned this pull request Oct 11, 2024

release-23.2: changefeedccl: add parallel io metrics #132484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedulerlatency: export Go scheduling latency metric #87883

schedulerlatency: export Go scheduling latency metric #87883

irfansharif commented Sep 12, 2022 •

edited

Loading

cockroach-teamcity commented Sep 12, 2022

irfansharif commented Sep 15, 2022

sumeerbhola left a comment

irfansharif left a comment

aadityasondhi left a comment

sumeerbhola left a comment

irfansharif left a comment

craig bot commented Sep 21, 2022

irfansharif commented Sep 21, 2022

craig bot commented Sep 21, 2022

sumeerbhola left a comment

irfansharif commented Sep 22, 2022

schedulerlatency: export Go scheduling latency metric #87883

schedulerlatency: export Go scheduling latency metric #87883

Conversation

irfansharif commented Sep 12, 2022 • edited Loading

cockroach-teamcity commented Sep 12, 2022

irfansharif commented Sep 15, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

craig bot commented Sep 21, 2022

irfansharif commented Sep 21, 2022

craig bot commented Sep 21, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif commented Sep 22, 2022

irfansharif commented Sep 12, 2022 •

edited

Loading