jobs: add metric for number of paused jobs #85467

amruss · 2022-08-02T15:29:16Z

See for context: https://docs.google.com/document/d/1XZzS6UlfQUVJQPKSttKsXSxwV6bAmzS4D5jDjPvJWjY/edit

We don't have any metrics on the number of paused jobs in the system. This can be an important monitoring primitive for customers, especially in the context of changefeeds.

Jira issue: CRDB-18256

amruss · 2022-10-05T16:11:27Z

Maybe interrupted jobs? Including paused and intermediate josb states (everything except cancelled, failed, running, and completed)

jayshrivastava · 2022-10-06T14:59:03Z

~~Is there something specific we want to add which is not covered in this section of the UI? For example, do we want a metric for paused changefeed jobs only?~~

Yes, we want something for paused changefeed jobs according to the extra context google doc.

jayshrivastava · 2022-10-13T20:40:49Z

Closing for now. We can use the metric jobs.changefeed.currently_idle

cockroach/pkg/ccl/changefeedccl/changefeed_test.go

Line 286 in faa2203

func TestChangefeedIdleness(t *testing.T) {

jayshrivastava · 2022-10-14T18:38:16Z

As it turns out paused implies idle, but idle does not imply paused, as per Yevgeniy's comments in #89752

Paused changefeed jobs will now show up as a counter in the debug UI. This counter will also be added to telemetry. Resolves: cockroachdb#85467 Release note: None

amruss · 2022-10-19T16:00:16Z

Note: may not be backportable

This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `crdb_internal.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Finally, this change updates existing tests to handle the fact that there is a new job always running in the background (since a lot of tests assert the state of the jobs table, having a new job can change test results). Informs: cockroachdb#90453 This change adds a virtual index to the `crdb_internal.jobs` table so that querying for paused jobs requires less work. Resolves: cockroachdb#85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are incremented at an interval defined by the cluster setting `jobs.metrics.interval.poll`.

Previously, performing a query on the crdb_internal.jobs table would require the entire virtual table to be generated, even if there was a filter being used. This change makes it so that only relevant rows are generated when filtering on `status`, which reduces the amount of rows which need to be processed. Informs: cockroachdb#85467 Release note: None

This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `crdb_internal.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Finally, this change updates existing tests to handle the fact that there is a new job always running in the background (since a lot of tests assert the state of the jobs table, having a new job can change test results). Informs: cockroachdb#90453 This change adds a virtual index to the `crdb_internal.jobs` table so that querying for paused jobs requires less work. Resolves: cockroachdb#85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are incremented at an interval defined by the cluster setting `jobs.metrics.interval.poll`.

90453: sql: add virtual index on status to crdb_internal.jobs table r=jayshrivastava a=jayshrivastava Previously, performing a query on the crdb_internal.jobs table would require the entire virtual table to be generated, even if there was a filter being used. This change makes it so that only relevant rows are generated when filtering on `status`, which reduces the amount of rows which need to be processed. Informs: #85467 Release note: None Epic: None Co-authored-by: Jayant Shrivastava <[email protected]>

This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `crdb_internal.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Finally, this change updates existing tests to handle the fact that there is a new job always running in the background (since a lot of tests assert the state of the jobs table, having a new job can change test results). Informs: cockroachdb#90453 This change adds a virtual index to the `crdb_internal.jobs` table so that querying for paused jobs requires less work. Resolves: cockroachdb#85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are incremented at an interval defined by the cluster setting `jobs.metrics.interval.poll`.

This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `system.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Resolves: cockroachdb#85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are updated at an interval defined by the cluster setting `jobs.metrics.interval.poll`, which is defauled to 10 seconds. Epic: None

89752: jobs/cdc: add metrics for paused jobs r=miretskiy a=jayshrivastava This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `system.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Resolves: #85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are updated at an interval defined by the cluster setting `jobs.metrics.interval.poll`, which is defauled to 10 seconds. Epic: None 94633: kvserver: document reproposals r=nvanbenschoten a=tbg Reproposals are a deep rabbit hole and an area in which past changes were all related to subtle bugs. Write it all up and in particular make some simplifications that ought to be possible if my understanding is correct: - have proposals always enter `(*Replica).propose` without a MaxLeaseIndex or prior encoded command set, i.e. `propose` behaves the same for reproposals as for first proposals. - assert that after a failed call to tryReproposeWithNewLeaseIndex, the command is not in the proposals map, i.e. check absence of a leak. - replace code that should be impossible to reach (and had me confused for a good amount of time) with an assertion. - add long comment on `r.mu.proposals`. This commit also moves `tryReproposeWithNewLeaseIndex` off `(*Replica)`, which is possible due to recent changes[^1]. In doing so, I realized there was a (small) data race (now fixed): when returning a `NotLeaseholderError` from that method, we weren't acquiring `r.mu`. It may have looked as though we were holding it already since we're accessing `r.mu.propBuf`, however that field has special semantics - it wraps `r.mu` and acquires it when needed. [^1]: The "below raft" test mentioned in the previous comment was changed in #93785 and no longer causes a false positive. Epic: CRDB-220 Release note: None 96650: kvserver: extract kvstorage.DestroyReplica r=pavelkalinnikov a=tbg This series of commits extracts `(*Replica).preDestroyRaftMuLocked` into a free-standing method `kvstorage.DestroyReplica`. In doing so, it separates the in-memory and on-disk steps that are a part of replica removal, and makes the on-disk steps unit testable. Touches #93241. Epic: CRDB-220 Release note: None 96659: sql: wrap stacktraceless errors with errors.Wrap r=andreimatei a=ecwall Fixes #95794 This replaces the previous attempt to add logging here #95797. The context itself cannot be augmented to add a stack trace to errors because it interferes with grpc timeout logic - gRPC compares errors directly without checking causes https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833. Although the method signature allows it, `Context.Err()` should not be overriden to customize the error: ``` // If Done is not yet closed, Err returns nil. // If Done is closed, Err returns a non-nil error explaining why: // Canceled if the context was canceled // or DeadlineExceeded if the context's deadline passed. // After Err returns a non-nil error, successive calls to Err return the same error. Err() error ``` Additionally, a child context of the augmented context may end up being used which will circumvent the stack trace capture. This change instead wraps `errors.Wrap` in a few places that might end up helping debug the original problem: 1) Where we call `Context.Err()` directly. 2) Where gRPC returns an error after possibly calling `Context.Err()` internally or returns an error that does not have a stack trace. Release note: None 96770: storage: don't modify the given cfg.Opts r=RaduBerinde a=RaduBerinde This change improves the `NewPebble` code to not modify the given `cfg.Opts`. Such behavior is surprising and can trip up tests that reuse the same config. Also, `ResolveEncryptedEnvOptions` and `wrapFilesystemMiddleware` no longer modify the `Options` directly; and `CheckNoRegistryFile` is now a standalone function. Release note: None Epic: none 96793: kvserver: de-flake TestReplicaProbeRequest r=pavelkalinnikov a=tbg Chanced upon this failure mode in unrelated PR #96781. Epic: none Release note: None Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

amruss added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-jobs T-jobs labels Aug 2, 2022

amruss assigned jayshrivastava Oct 5, 2022

jayshrivastava closed this as completed Oct 13, 2022

jayshrivastava mentioned this issue Oct 13, 2022

jobs/cdc: add metrics for paused jobs #89752

Merged

jayshrivastava reopened this Oct 14, 2022

jayshrivastava mentioned this issue Oct 21, 2022

sql: add virtual index on status to crdb_internal.jobs table #90453

Merged

amruss mentioned this issue Dec 2, 2022

changefeedccl: add metric for paused changefeeds #86789

Closed

craig bot closed this as completed in 690da3e Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: add metric for number of paused jobs #85467

jobs: add metric for number of paused jobs #85467

amruss commented Aug 2, 2022 •

edited by cockroach-jira-scripts

Loading

amruss commented Oct 5, 2022

jayshrivastava commented Oct 6, 2022 •

edited

Loading

jayshrivastava commented Oct 13, 2022

jayshrivastava commented Oct 14, 2022

amruss commented Oct 19, 2022

jobs: add metric for number of paused jobs #85467

jobs: add metric for number of paused jobs #85467

Comments

amruss commented Aug 2, 2022 • edited by cockroach-jira-scripts Loading

amruss commented Oct 5, 2022

jayshrivastava commented Oct 6, 2022 • edited Loading

jayshrivastava commented Oct 13, 2022

jayshrivastava commented Oct 14, 2022

amruss commented Oct 19, 2022

amruss commented Aug 2, 2022 •

edited by cockroach-jira-scripts

Loading

jayshrivastava commented Oct 6, 2022 •

edited

Loading