jobs: Jobs Observability tracking issue #64992

adityamaru · 2021-05-11T12:11:43Z

This is a tracking issue for the investigation and work that is being done to improve observability into job execution. The initial brainstorming document can be found at https://docs.google.com/document/d/19VEvNNW8wdep5cYe4TJQY5_WvrmDqTgnh3V2Bq33gag/edit?usp=sharing.

Tracing related work

Aggregating Finish()'ed and in-flight spans from all nodes in the cluster.
- Issue: sql: allow inflight traces to be collected across the cluster #60999
- Job execution is distributed across nodes in the cluster. A consistent view of what a job is doing, or has done during execution, is only possible if we are able to aggregate the spans in the execution path across the cluster.
- Recordings can be split between persistent storage and in-memory data structures; care must be taken when thinking about aggregation scan semantics.
- Could we get away with not relying on a service that fetches inflight spans from the entire cluster? If every node flushed to the system table often enough, then our introspection into the job could just rely on looking at the system table. This might simplify things by not relying on an RPC fanout.
Mutable/ephemeral payloads for more observability into operations that can take long/hang indefinitely.
- Issue: tracing: enhance payloads API to allow for ephemeral payloads #60997
Filtering based on payload type to avoid job recordings from spamming traces included in reports.
- Jobs are likely to record several structured entries during the course of their execution.
- Audit places in the code where we pull recordings to see if the high write rate of jobs could result in unrelated traces becoming hard to parse.
- Investigate whether job-related structured entries should only be exposed via job-specific tools and builtins.
Memory footprint of long-lived spans (egs: CDC that could run for the lifetime of the cluster).
- Issue: tracing: memory budgeting #59188
- Tie's into how often we drain the spans and write the information to persistent storage.
- Would prefer to eagerly flush rather than drop traces during job execution.

Jobs related work

Persistent storage to deal with node shutdown and stopped SQL pods, via asynchronous flushing of tracing spans from RAM to a system table.
- Enables job execution to be inspectable even after the job has reached a terminal state.
- Helps bound memory overhead of the in-memory data structures in long-living spans, which in turn prevents dropping of future recordings.
- Presumably, every registry will be responsible for periodic flushing, therefore organization and aggregation of entries in the system table will need to be worked out.
Builtins, tables, and introspection tools built on top of the system table and inflight span aggregation infrastructure.
- A job could have both persisted spans as well as inflight spans, and therefore we must scan the system table for spans related to the job before aggregating inflight spans.

Epic CRDB-7076

Jira issue: CRDB-7375

The text was updated successfully, but these errors were encountered:

adityamaru · 2021-05-11T12:14:58Z

Created this issue as an attempt to scope out the work stemming from the comments on the document. I would guess that some of these tasks will require their own design documents once they are better defined.

adityamaru · 2021-05-13T11:10:44Z

Summarizing a discussion the BulkIO team had regarding our next steps:

The problem can be broken down into 3 categories and are listed in terms of priority for the team:

Inspect what a job is doing at the moment.

Solving this problem is the team's first priority, and tracing can help us here. At a point in time, a user should be able to fetch all inflight spans for a job that is currently executing. This involves fetching inflight spans from the node the job was adopted, the nodes on which the job planned its processors, and the nodes serving the KV requests issued by these processors. Doing so will allow engineers to identify what each node is doing at a point in time, and understand whether the job is slow, hanging on a request, or happily executing.

Expose information to the end-user.

This is where we might consider introducing a system table for persisting information that has been aggregated during job execution. We do not believe this feature will rely on tracing directly, because trace events are not meaningful to expose to the end-user. BulkIO will likely run custom aggregation and parsing of information it receives during job execution, and identify what is notable to expose to the end-user.
Jobs and DistSQL already rely on remote nodes sending metas to the coordinator node. It is likely that we will send execution information via metas to the coordinator node, following which the aggregator will parse this information and persist notable events in the system table. We might use tracing to gather this information during execution.

Persist all trace events during job execution.

This would be really nice to have as it would persist historical information about job execution, but the team does not think that a system table is the answer. We have seen the system.jobs table grow to hundreds of gigabytes, and considering the magnitude of trace events that a job could emit, we might need to think about persisting this information elsewhere (logs?). BulkIO is unsure whether this is in scope of the team's milestones, or would be a valuable feature to CRDB infrastructure in general.

64977: colexec: implement vectorized ntile window function r=DrewKimball a=DrewKimball This patch implements the `ntile` window function in the vectorized engine. `ntile` takes in an integer argument `num_buckets` and then distributes all rows in a partition equally between the buckets, outputting the bucket number for each row. In the vectorized implementation, batches are buffered until the end of a partition is reached, at which point the `ntile` bucket values can be calculated. The batches are emitted in a streaming fashion; as soon as a batch is fully processed, it is returned and work is paused until the next call to `Next()`. See #37035 Release note (sql change): the vectorized engine now supports the ntile window function. 65322: jobs: surface the trace_id linked to job execution r=adityamaru a=adityamaru Please see individual commit messages. Informs: #64992 65415: authors: add annezhu98 to authors r=annezhu98 a=annezhu98 Release note: None 65418: authors: add <IsaDash> to authors r=IsaDash a=IsaDash Release note: None Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Anne Zhu <[email protected]> Co-authored-by: Isa Dash <[email protected]>

65324: cli: add cockroach debug job-trace r=pbardea,knz a=adityamaru This change adds a new cli command - `cockroach debug job-trace` that can be used to grab information from inflight trace spans associated with the execution of a particular job. The command takes a `jobID` and file destination where the information will be dumped. It also requires a `--url` of the node from which to pull the inflight spans. In the future we will have a means of pulling inflight spans from the cluster, but at the moment this command must be run against every node to get a complete picture of job execution. Informs: #64992 Release note (cli change): Adds a `cockroach debug job-trace` command that takes 2 arguments: <jobID> and file destination, along with a `--url` pointing to the node on which to execute this command. The command pulls information about inflight trace spans associated with the job and dumps it to the file destination. Co-authored-by: Aditya Maru <[email protected]>

66089: cli: order span payloads by start_time for readability r=knz a=adityamaru During job execution, there can be several root spans on a node. Egs: resumer, processor While the recordings within a span are sorted by start time, since the root spans are stored in an unordered map, the recordings across root spans might not be sorted by start_time. When viewing the output files for a job this results in the processor span printing its payload before the job payload which is not intuitive. We might change how we display recordings in the future, but for the time being this fix makes the "ordering" of events deterministic. Informs: #64992 Release note: None Co-authored-by: Aditya Maru <[email protected]>

…rded This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded at. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, than all events for the parent span followed by all events for the child span. Currently we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: cockroachdb#64992

65559: tracing,tracingservice: adds a trace service to pull clusterwide trace spans r=irfansharif,abarganier a=adityamaru Previously, every node in the cluster had a local inflight span registry that was aware of all the spans that were rooted on that particular node. Child spans of a given traceID executing on a remote node would only become visible to the local registry once execution completes, and the span pushes its recordings over gRPC to the "client" node. This change introduces a `tracingservice` package. Package tracingservice contains a gRPC service to be used for remote inflight span access. It is used for pulling inflight spans from all CockroachDB nodes. Each node will run a trace service, which serves the inflight spans from the local span registry on that node. Each node will also have a trace client dialer, which uses the nodedialer to connect to another node's trace service, and access its inflight spans. The trace client dialer is backed by a remote trace client or a local trace client, which serves as the point of entry to this service. Both clients support the `TraceClient` interface, which includes the following functionalities: - GetSpanRecordings The spans for a traceID are sorted by `StartTime` before they are returned. The per-node trace dialer has yet to be hooked up to an appropriate location depending on where we intend to use it. Resolves: #60999 Informs: #64992 Release note: None 66149: cloud: fix gcs to resuming reader r=dt a=adityamaru This change does a few things: 1. gcs_storage was not returning a resuming reader as a result of which the Read method of the resuming reader that contains logic to retry on certain kinds of errors was not being invoked. 2, Changes the resuming reader to take a storage specific function that can define what errors are retryable in the resuming reader. All storage providers use the same deciding function at the moment and so behavior is unchanged. Release note: None 66152: storage: Disable read sampling and read compactions r=sumeerbhola a=itsbilal Read-triggered compactions are already disabled on 21.1. As the fixes to address known shortcomings with read-triggered compactions are a bit involved (see cockroachdb/pebble#1143 ), disable the feature on master until that issue is fixed. That prevents this known issue from getting in the way of performance experiments. Release note: None. 66155: sql: drop "cluster" from EXPLAIN ANALYZE to improve readability r=maryliag a=maryliag Remove the word "cluster" from "cluster nodes" and "cluster regions" on EXPLAIN ANALYZE to improve readability. Release note: None 66157: sql: add time & contention time to EXPLAIN ANALYZE. r=matthewtodd a=matthewtodd The new fields are labeled `KV time` and `KV contention time`: ``` > EXPLAIN ANALYZE -> UPDATE users SET name = 'Bob Loblaw' -> WHERE id = '32a962b7-8440-4b81-97cd-a7d7757d6eac'; info -------------------------------------------------------------------------------------------- planning time: 353µs execution time: 3ms distribution: local vectorized: true rows read from KV: 52 (5.8 KiB) cumulative time spent in KV: 2ms maximum memory usage: 60 KiB network usage: 0 B (0 messages) cluster regions: us-east1 • update │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ table: users │ set: name │ auto commit │ └── • render │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ └── • filter │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ filter: id = '32a962b7-8440-4b81-97cd-a7d7757d6eac' │ └── • scan cluster nodes: n1 cluster regions: us-east1 actual row count: 52 KV time: 2ms KV contention time: 0µs KV rows read: 52 KV bytes read: 5.8 KiB estimated row count: 50 (100% of the table; stats collected 3 minutes ago) table: users@primary spans: FULL SCAN (42 rows) Time: 4ms total (execution 4ms / network 0ms) ``` Resolves #64200 Release note (sql change): EXPLAIN ANALYZE output now includes, for each plan step, the total time spent waiting for KV requests as well as the total time those KV requests spent contending with other transactions. Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Matthew Todd <[email protected]>

…rded This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded at. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, than all events for the parent span followed by all events for the child span. Currently we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: cockroachdb#64992

66147: tracing: wrap structured recordings with time at which they were recorded r=abarganier a=adityamaru This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, rather than all events for the parent span followed by all events for the child span. Currently, we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: #64992 Co-authored-by: Aditya Maru <[email protected]>

63488: monitoring: renovate grafana dashboards r=kzh a=sai-roach This PR adds in renovated grafana dashboards that aim for feature parity with DB Console metrics page. Dashboards: - Overview - Hardware - Runtime - SQL - Storage - Replication - Distributed - Queues - Slow Requests - Changefeeds These dashboards can be previewed by following the instructions in the monitoring [README.md](https://github.com/cockroachdb/cockroach/blob/master/monitoring/README.md) for spinning up a quick grafana instance. Release note (ops change): The grafana dashboards have been updated to more closely resemble DB console metrics. 66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru Previously, the trace collector was dialing up a node, visiting all the root spans on the nodes inflight registry, and placing `tracingpb.RecordedSpans` into a flat slice. This caused loss of information about which spans belonged to a chain rooted at a fixed root span. Such a chain is referred to as a `tracing.Recording`. Every node can have multiple `tracing.Recording`s with the same `trace_id`, and they each represent a traced remote operation. This change maintains the `tracing.Recording` grouping of spans by getting the collector to return a `[]tracing.Recording` for each node. The collectors' unit of iteration consequently becomes a `tracing.Recording`. This makes more sense when you think about how we want to consume these traces. Every `tracing.Recording` is a new traced remote operation, and should be visualized as such in Jaegar, JSON etc. This change also augments the collector iterator to return the nodeID of the node that the current `tracing.Recording` belongs too. Informs: #64992 Release note: None 66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss fixes #66683 Release note: None Co-authored-by: sai-roach <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

66042: cloud: add trace recording to retry loops in external storage r=pbardea a=adityamaru This change adds some tracing to the retry loops we have in external storage. Informs: #64992 Release note: None 66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru Previously, the trace collector was dialing up a node, visiting all the root spans on the nodes inflight registry, and placing `tracingpb.RecordedSpans` into a flat slice. This caused loss of information about which spans belonged to a chain rooted at a fixed root span. Such a chain is referred to as a `tracing.Recording`. Every node can have multiple `tracing.Recording`s with the same `trace_id`, and they each represent a traced remote operation. This change maintains the `tracing.Recording` grouping of spans by getting the collector to return a `[]tracing.Recording` for each node. The collectors' unit of iteration consequently becomes a `tracing.Recording`. This makes more sense when you think about how we want to consume these traces. Every `tracing.Recording` is a new traced remote operation, and should be visualized as such in Jaegar, JSON etc. This change also augments the collector iterator to return the nodeID of the node that the current `tracing.Recording` belongs too. Informs: #64992 Release note: None 66691: cloud: bump orchestrator version to 21.1.2 r=celiala a=celiala Release note: None 66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss fixes #66683 Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Celia La <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

66679: sql,tracing: introduce crdb_internal.cluster_inflight_traces r=abarganier a=adityamaru This change adds a new indexed, virtual table `crdb_internal.cluster_inflight_traces`. This table surfaces cluster-wide inflight traces for the trace_id specified via an index constraint. Each row in the virtual table corresponds to a `tracing.Recording` on a particular node for the given trace ID. A `tracing.Recording` is the trace of a single operation rooted at a root span on that node. Under the hood, the virtual table contacts all "live" nodes in the cluster via the trace collector which streams back a recording at a time. The table has 3 additional columns that surface the raw JSON, string, and JaegarJSON representation of the recording. These formats are what we dump in a stmt bundle as well, and have been considered the best way to consume traces. This table is not meant to be consumed directly via the SQL shell but will have CLI wrapper built on top of it that will assimilate and write the traces to files. Similar to how we dump a stmt bundle. This change also tweaks some of the recording->string methods to include StructuredRecords. Informs: #64992 Release note (sql change): adds a virtual table `crdb_internal.cluster_inflight_traces` which surfaces cluster-wide inflight traces for the trace_id specified via an index constraint. The output of this table is not appropriate to consume over a SQL connection; follow up changes will add CLI wrappers to make the interaction more user-friendly. 66837: sql/schemachanger: reorder args on Build, clone nodes, minor renaming r=fqazi a=ajwerner Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

65725: backupccl: add some span recordings to the backup resumer r=pbardea a=adityamaru This change adds some initial span recordings to the backup resumer that might be interesting when inspecting an executing backup. The processor and KV level requests will be dealt with in subsequent commits. Informs: #64992 Release note: None Co-authored-by: Aditya Maru <[email protected]>

adityamaru · 2023-07-10T17:07:23Z

This is now being tracked in https://cockroachlabs.atlassian.net/browse/CRDB-8964

adityamaru added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 11, 2021

adityamaru self-assigned this May 17, 2021

This was referenced May 17, 2021

jobs: surface the trace_id linked to job execution #65322

Merged

cli: add cockroach debug job-trace #65324

Merged

adityamaru added T-disaster-recovery A-disaster-recovery labels May 17, 2021

This was referenced May 21, 2021

tracing,tracingservice: adds a trace service to pull clusterwide trace spans #65559

Merged

backupccl: add some span recordings to the backup resumer #65725

Merged

This was referenced Jun 3, 2021

cloud: add trace recording to retry loops in external storage #66042

Merged

cli: order span payloads by start_time for readability #66089

Merged

adityamaru mentioned this issue Jun 7, 2021

tracing: wrap structured recordings with time at which they were recorded #66147

Merged

jlinder added the T-observability-inf label Jun 16, 2021

This was referenced Jun 21, 2021

tracing: collector was incorrectly flattening recordings #66678

Merged

sql,tracing: introduce crdb_internal.cluster_inflight_traces #66679

Merged

exalate-issue-sync bot removed T-disaster-recovery T-observability-inf labels Jun 21, 2021

mwang1026 added the T-disaster-recovery label Jun 24, 2021

adityamaru mentioned this issue Jul 8, 2021

jobs: dump span traces on job failure and success #67391

Closed

livlobo mentioned this issue Sep 24, 2021

jobs,bulk,*: Provide more visibility into job status #47212

Closed

adityamaru closed this as completed Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: Jobs Observability tracking issue #64992

jobs: Jobs Observability tracking issue #64992

adityamaru commented May 11, 2021 •

edited by cockroach-jira-scripts

Loading

adityamaru commented May 11, 2021

adityamaru commented May 13, 2021

adityamaru commented Jul 10, 2023

jobs: Jobs Observability tracking issue #64992

jobs: Jobs Observability tracking issue #64992

Comments

adityamaru commented May 11, 2021 • edited by cockroach-jira-scripts Loading

Tracing related work

Jobs related work

adityamaru commented May 11, 2021

adityamaru commented May 13, 2021

adityamaru commented Jul 10, 2023

adityamaru commented May 11, 2021 •

edited by cockroach-jira-scripts

Loading