Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedulerlatency: export Go scheduling latency metric #87883

Merged
merged 1 commit into from
Sep 21, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Sep 12, 2022

And record data into CRDB's internal time-series database. Informs
#82743 and #87823. To export scheduling latencies to prometheus, we
choose an exponential bucketing scheme with base multiple of 1.1, and
the output range bounded to [50us, 100ms). This makes for ~70 buckets.
It's worth noting that the default histogram buckets used in Go are
not fit for our purposes. If we care about improving it, we could
consider patching the runtime.

  bucket[  0] width=0s boundary=[-Inf, 0s)
  bucket[  1] width=1ns boundary=[0s, 1ns)
  bucket[  2] width=1ns boundary=[1ns, 2ns)
  bucket[  3] width=1ns boundary=[2ns, 3ns)
  bucket[  4] width=1ns boundary=[3ns, 4ns)
  ...
  bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs)
  bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs)
  bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms)
  bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms)
  bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms)
  ...
  bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s)
  bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s)
  bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s)
  bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf)

Release note: None
Release justification: observability-only PR, low-risk high-benefit; would help understand admission control out in the wild

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif marked this pull request as ready for review September 15, 2022 21:43
@irfansharif irfansharif requested a review from a team as a code owner September 15, 2022 21:43
@irfansharif irfansharif requested a review from a team September 15, 2022 21:43
@irfansharif irfansharif requested a review from a team as a code owner September 15, 2022 21:43
@irfansharif irfansharif requested a review from a team September 15, 2022 21:43
@irfansharif irfansharif requested review from a team as code owners September 15, 2022 21:43
@irfansharif irfansharif requested review from rhu713, sumeerbhola and andrewbaptist and removed request for a team and rhu713 September 15, 2022 21:43
@irfansharif
Copy link
Contributor Author

This is what it looks like in prometheus and within CRDB.

image

image

Hopefully this gives us more data to consider when evaluating the runnable goroutines per-p controller input.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 12 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @irfansharif)


pkg/util/metric/metric.go line 86 at r6 (raw file):

// WindowedHistogram represents a histogram with data over recent window of
// time.

So this is not allowed to be cumulative? Or the window could be arbitrary and overlapping? The windowed term is a bit confusing.


pkg/util/metric/metric.go line 571 at r6 (raw file):

	// Should not extrapolate past the upper bound of the largest bucket.
	if val > *buckets[len(buckets)-1].UpperBound {

can you add some invariants as code comments earlier in this function? It is not clear why we need this if-condition because I was expecting an invariant that rank <= count, and bucketEnd <= buckets[len(buckets)-1].UpperBound. If this is just defensive please note it in a code comment.


pkg/util/schedulerlatency/histogram.go line 62 at r6 (raw file):

		mult: float64(time.Second.Nanoseconds()),
	}
	h.mu.buckets = buckets

Is +Inf allowed for Prometheus histograms?


pkg/util/schedulerlatency/histogram.go line 101 at r6 (raw file):

		totalCount += count
		if count != 0 {
			// N.B. This computed sum is an underestimate.

underestimate because h.mu.buckets[i] is the lower bound for the bucket?


pkg/util/schedulerlatency/histogram.go line 161 at r6 (raw file):

// reBucketExpAndTrim takes a list of bucket boundaries (lower bound inclusive)
// and downsamples the buckets to those a multiple of base apart. The end result

I didn't quite understand the base parameter. Can you add a small example as a code comment.

@irfansharif irfansharif force-pushed the 220909.admission-obs branch 2 times, most recently from 859c67e to 17b26d3 Compare September 16, 2022 00:37
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, and @sumeerbhola)


pkg/util/metric/metric.go line 86 at r6 (raw file):

Previously, sumeerbhola wrote…

So this is not allowed to be cumulative? Or the window could be arbitrary and overlapping? The windowed term is a bit confusing.

It's not allowed to be cumulative, no. The windows can also be arbitrary and overlapping. In our histogram for ex. the window is the last sample_duration-ish worth of data; in the other histogram type defined in this pkg/util/metric package, there's a rollover of this window that occurs every few ticks. This PR didn't introduce this vocabulary, though the interface is needed given the new runtimeHistogram type. That type will likely be used to capture other statistics from the Go runtime, like the distribution individual GC-related stop-the-world pause latencies (/gc/pauses:seconds). I've added a comment explaining some of the motivation behind this windowing.

// [...]. It's used primarily to record histogram data into CRDB's internal
// time-series database, which does not know how to encode cumulative
// histograms. What it does instead is scrape off sample count, sum of values,
// and values at specific quantiles from "windowed" histograms and record that
// data directly. These windows could be arbitrary and overlapping.

pkg/util/metric/metric.go line 571 at r6 (raw file):
This is pure code movement from above, not code I introduced. Looks like @aadityasondhi lifted it off of https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L75, mostly unchanged.

rank <= count

It's not just defensive code, that invariant doesn't hold given this right above:

	n := float64(*histogram.SampleCount)
        // ...
	rank := uint64(((q / 100) * n) + 0.5)

I'm not entirely sure what the 0.5 is doing here, or why it's needed (if at all), but I'll leave it to the @cockroachdb/obs-inf-prs folks to improve commentary here.


pkg/util/schedulerlatency/histogram.go line 62 at r6 (raw file):

Previously, sumeerbhola wrote…

Is +Inf allowed for Prometheus histograms?

It is allowed, see https://prometheus.io/docs/concepts/metric_types/#histogram. Just sanity checking against /_status/vars, I see the following:

# HELP go_scheduler_latency Go scheduling latency
# TYPE go_scheduler_latency histogram
go_scheduler_latency_bucket{le="53247.99999999999"} 153
go_scheduler_latency_bucket{le="59392"} 156
go_scheduler_latency_bucket{le="65535.99999999999"} 159
go_scheduler_latency_bucket{le="73727.99999999999"} 160
...
go_scheduler_latency_bucket{le="7.549747199999999e+07"} 167
go_scheduler_latency_bucket{le="8.388607999999999e+07"} 167
go_scheduler_latency_bucket{le="9.646899199999999e+07"} 167
go_scheduler_latency_bucket{le="1.0905190399999999e+08"} 167
go_scheduler_latency_bucket{le="+Inf"} 167
go_scheduler_latency_sum 8.2944e+06
go_scheduler_latency_count 167

pkg/util/schedulerlatency/histogram.go line 101 at r6 (raw file):

Previously, sumeerbhola wrote…

underestimate because h.mu.buckets[i] is the lower bound for the bucket?

Yes, added a comment. This code was lifted off of here BTW: https://github.com/prometheus/client_golang/blob/5b7e8b2e6716df0ceda9df81feb15910c7efa150/prometheus/go_collector_latest.go#L544-L546.


pkg/util/schedulerlatency/histogram.go line 161 at r6 (raw file):

Previously, sumeerbhola wrote…

I didn't quite understand the base parameter. Can you add a small example as a code comment.

This code was lifted off of here BTW: https://github.com/prometheus/client_golang/blob/5b7e8b2e6716df0ceda9df81feb15910c7efa150/prometheus/internal/go_runtime_metrics.go#L110. There's some datadriven test code in this package that has interactive examples, copied a snippet over into this comment.

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @sumeerbhola)


pkg/util/metric/metric.go line 571 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This is pure code movement from above, not code I introduced. Looks like @aadityasondhi lifted it off of https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L75, mostly unchanged.

rank <= count

It's not just defensive code, that invariant doesn't hold given this right above:

	n := float64(*histogram.SampleCount)
        // ...
	rank := uint64(((q / 100) * n) + 0.5)

I'm not entirely sure what the 0.5 is doing here, or why it's needed (if at all), but I'll leave it to the @cockroachdb/obs-inf-prs folks to improve commentary here.

@irfansharif is correct about the invariant not holding. This is because the SampleCount includes the implicit +Inf bucket but the buckets[len(buckets)-1].UpperBound refers to the largest bucket defined by us. This is the case because the client library doesn't give us access to the +Inf bucket which Prometheus uses under the hood. If the quantile is high enough, the val surpasses the upper bound of the largest bucket. In this case, using that interpolated value seemed wrong as we would be extrapolating the value. Additionally, if for a specific metric, we see our q99 values to be at the bucket boundary, it provides a strong indicator for us to adjust the buckets to provide a more accurate histogram. FWIW the Prometheus client library does the same when the resulting value is in the +Inf bucket and returns the upper bound of the second last bucket: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L103. It is cleaner/easier for them since they have access to the +Inf bucket internally.

The 0.5 was added for rounding purposes. I went back and forth on whether to have it at all but thought it made sense for smaller SampleCount cases like in this test:

require.Equal(t, 100.0, h.ValueAtQuantileWindowed(99))

Open to suggestions if you think we should not be rounding and should just go with the truncated value when converted to an int. If we decide to keep this, I can add a line there to explain why we do this.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 50 files at r4, 5 of 28 files at r5, 4 of 12 files at r6, 44 of 44 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, @irfansharif, and @sumeerbhola)


pkg/util/metric/metric.go line 571 at r6 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

@irfansharif is correct about the invariant not holding. This is because the SampleCount includes the implicit +Inf bucket but the buckets[len(buckets)-1].UpperBound refers to the largest bucket defined by us. This is the case because the client library doesn't give us access to the +Inf bucket which Prometheus uses under the hood. If the quantile is high enough, the val surpasses the upper bound of the largest bucket. In this case, using that interpolated value seemed wrong as we would be extrapolating the value. Additionally, if for a specific metric, we see our q99 values to be at the bucket boundary, it provides a strong indicator for us to adjust the buckets to provide a more accurate histogram. FWIW the Prometheus client library does the same when the resulting value is in the +Inf bucket and returns the upper bound of the second last bucket: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L103. It is cleaner/easier for them since they have access to the +Inf bucket internally.

The 0.5 was added for rounding purposes. I went back and forth on whether to have it at all but thought it made sense for smaller SampleCount cases like in this test:

require.Equal(t, 100.0, h.ValueAtQuantileWindowed(99))

Open to suggestions if you think we should not be rounding and should just go with the truncated value when converted to an int. If we decide to keep this, I can add a line there to explain why we do this.

@aadityasondhi it would be great to have a long code comment stating what you wrote above.


pkg/util/metric/metric.go line 550 at r7 (raw file):

	}

	rank := uint64(((q / 100) * n) + 0.5)

suggestion for @aadityasondhi
// Since we are missing the +Inf bucket, CumulativeCounts may never exceed rank. By omitting the highest bucket we have from the search, the failed search will land on that last bucket and we don't have to do any special checks regarding landing on a non-existent bucket.


pkg/util/metric/metric.go line 553 at r7 (raw file):

	b := sort.Search(len(buckets)-1, func(i int) bool { return *buckets[i].CumulativeCount >= rank })

	var (

suggestion for @aadityasondhi
// bucketStart defaults to 0, which we assume is the lower bound of the smallest bucket.


pkg/util/schedulerlatency/histogram.go line 205 at r7 (raw file):

		if i+2 > len(b) {
			break
		}

// Invariant: i+2 <= len(b)

so if i+2==len(b) we will index out of bounds in the next if-block?


pkg/util/schedulerlatency/histogram.go line 212 at r7 (raw file):

		// We're looking at the two before after the first one that's crossed the
		// min limit. Since we expect recordings at the min value, we don't want
		// to have that bucket start at -Inf, so we merge the buckets before.

I had a hard time understanding this comment so I tried to rewrite it in a different way:

b[i+2] > min and is the lower-bound of the i+2 bucket. So min must be contained in bucket i+1 or earlier. Note that since we start with i=0, it is possible that we never checked that b[i+1] <= min, but if we did, min will be in bucket i+1. Let this min containing bucket be bucket j. Bucket j-1 should be set to a lower-bound of -Inf.

Which suggests that the logic is not quite right, since it will miss the case where b[0] contained the min. Though that corner case is not very interesting. But we should either fix that corner case, or explain that we are ignoring it.


pkg/util/schedulerlatency/histogram.go line 236 at r7 (raw file):

	// From now on, bucket should always have a non-Inf value because
	// Infs are only ever at the ends of the bucket lists, so
	// arithmetic operations on it are non-NaN.

How about a comment like the following:

bucket is the lower bound of the lowest bucket that has not been added to newBuckets. We will add it to newBuckets, but we wait to add it until we find the next bucket that is >= bucket*base.


pkg/util/schedulerlatency/histogram.go line 248 at r7 (raw file):

		}
		// The +Inf bucket will always be the last one, and we'll always
		// end up including it here because bucket

missing end of sentence.

Could use a comment like:
+Inf bucket is guaranteed to be added: When i is at the +Inf bucket we will fall through to here, and add the existing value of bucket and set bucket to this +Inf bucket, and then exit the for loop, and proceed to add the +Inf bucket in the append that follows the loop.


pkg/util/schedulerlatency/testdata/runtime_histogram line 2 at r7 (raw file):

# Explore how the histogram type that wraps around runtime/metrics behaves.
# Initialize it buckets that are:

... it to buckets ... ?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @andrewbaptist, and @sumeerbhola)


pkg/util/metric/metric.go line 571 at r6 (raw file):

Previously, sumeerbhola wrote…

@aadityasondhi it would be great to have a long code comment stating what you wrote above.

Just added a version of the above in this PR.


pkg/util/metric/metric.go line 550 at r7 (raw file):

Previously, sumeerbhola wrote…

suggestion for @aadityasondhi
// Since we are missing the +Inf bucket, CumulativeCounts may never exceed rank. By omitting the highest bucket we have from the search, the failed search will land on that last bucket and we don't have to do any special checks regarding landing on a non-existent bucket.

Just added it in this PR.


pkg/util/metric/metric.go line 553 at r7 (raw file):

Previously, sumeerbhola wrote…

suggestion for @aadityasondhi
// bucketStart defaults to 0, which we assume is the lower bound of the smallest bucket.

Just added it in this PR.


pkg/util/schedulerlatency/histogram.go line 205 at r7 (raw file):

Previously, sumeerbhola wrote…

// Invariant: i+2 <= len(b)

so if i+2==len(b) we will index out of bounds in the next if-block?

I've reworked this code to be less sloppy. BTW, I'm really enjoying your airtight reviews -- I should generally be less sloppy but with reviews like this I feel I can get way with more than my usual level of sloppiness, so I'm grateful either way.


pkg/util/schedulerlatency/histogram.go line 212 at r7 (raw file):

Previously, sumeerbhola wrote…

I had a hard time understanding this comment so I tried to rewrite it in a different way:

b[i+2] > min and is the lower-bound of the i+2 bucket. So min must be contained in bucket i+1 or earlier. Note that since we start with i=0, it is possible that we never checked that b[i+1] <= min, but if we did, min will be in bucket i+1. Let this min containing bucket be bucket j. Bucket j-1 should be set to a lower-bound of -Inf.

Which suggests that the logic is not quite right, since it will miss the case where b[0] contained the min. Though that corner case is not very interesting. But we should either fix that corner case, or explain that we are ignoring it.

Handled the corner case.


pkg/util/schedulerlatency/histogram.go line 236 at r7 (raw file):

Previously, sumeerbhola wrote…

How about a comment like the following:

bucket is the lower bound of the lowest bucket that has not been added to newBuckets. We will add it to newBuckets, but we wait to add it until we find the next bucket that is >= bucket*base.

Done.


pkg/util/schedulerlatency/histogram.go line 248 at r7 (raw file):

Previously, sumeerbhola wrote…

missing end of sentence.

Could use a comment like:
+Inf bucket is guaranteed to be added: When i is at the +Inf bucket we will fall through to here, and add the existing value of bucket and set bucket to this +Inf bucket, and then exit the for loop, and proceed to add the +Inf bucket in the append that follows the loop.

Done.


pkg/util/schedulerlatency/testdata/runtime_histogram line 2 at r7 (raw file):

Previously, sumeerbhola wrote…

... it to buckets ... ?

Done.

And record data into CRDB's internal time-series database. Informs
\cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we
choose an exponential bucketing scheme with base multiple of 1.1, and
the output range bounded to [50us, 100ms). This makes for ~70 buckets.
It's worth noting that the default histogram buckets used in Go are
not fit for our purposes. If we care about improving it, we could
consider patching the runtime.

  bucket[  0] width=0s boundary=[-Inf, 0s)
  bucket[  1] width=1ns boundary=[0s, 1ns)
  bucket[  2] width=1ns boundary=[1ns, 2ns)
  bucket[  3] width=1ns boundary=[2ns, 3ns)
  bucket[  4] width=1ns boundary=[3ns, 4ns)
  ...
  bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs)
  bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs)
  bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms)
  bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms)
  bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms)
  ...
  bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s)
  bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s)
  bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s)
  bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf)

Release note: None
@craig
Copy link
Contributor

craig bot commented Sep 21, 2022

Canceled.

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 21, 2022

Build succeeded:

@craig craig bot merged commit 766b62d into cockroachdb:master Sep 21, 2022
@irfansharif irfansharif deleted the 220909.admission-obs branch September 21, 2022 21:19
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment nits

Reviewed 4 of 4 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server.go line 1470 at r8 (raw file):

	// We can now add the node registry.
	s.recorder.AddNode( // XXX: Has to occur before

leftover partial comment


pkg/util/metric/metric.go line 600 at r8 (raw file):

	// The 0.5 was added for rounding purposes. I went back and forth on whether to
	// have it at all but thought it made sense for smaller SampleCount cases like
	// in this test:

the reference to the test is missing


pkg/util/schedulerlatency/histogram.go line 210 at r8 (raw file):

	}
	// b[j] > min and is the lower-bound of the j-th bucket. The min must be
	// contained in the (j-1)-th bucket or earlier. We want to merge 0th bucket

the "or earlier" part is no longer true since you reworked this logic. Which is important since the min being in the (j-1)-th bucket for sure makes the logic here that trims the bucket list to start at the (j-2)-th bucket correct.


pkg/util/schedulerlatency/histogram.go line 216 at r8 (raw file):

	} else {
		// We want trim the bucket list to start at (j-2)-th bucket, so just
		// have bucket before the one containing the min.

... have one bucket before the one ...

@irfansharif
Copy link
Contributor Author

#88458 + appended to the 22.2 backport #88403.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants