-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: Jobs Observability tracking issue #64992
Comments
Created this issue as an attempt to scope out the work stemming from the comments on the document. I would guess that some of these tasks will require their own design documents once they are better defined. |
Summarizing a discussion the BulkIO team had regarding our next steps: The problem can be broken down into 3 categories and are listed in terms of priority for the team:
Solving this problem is the team's first priority, and tracing can help us here. At a point in time, a user should be able to fetch all inflight spans for a job that is currently executing. This involves fetching inflight spans from the node the job was adopted, the nodes on which the job planned its processors, and the nodes serving the KV requests issued by these processors. Doing so will allow engineers to identify what each node is doing at a point in time, and understand whether the job is slow, hanging on a request, or happily executing.
This is where we might consider introducing a system table for persisting information that has been aggregated during job execution. We do not believe this feature will rely on tracing directly, because trace events are not meaningful to expose to the end-user. BulkIO will likely run custom aggregation and parsing of information it receives during job execution, and identify what is notable to expose to the end-user.
This would be really nice to have as it would persist historical information about job execution, but the team does not think that a system table is the answer. We have seen the system.jobs table grow to hundreds of gigabytes, and considering the magnitude of trace events that a job could emit, we might need to think about persisting this information elsewhere (logs?). BulkIO is unsure whether this is in scope of the team's milestones, or would be a valuable feature to CRDB infrastructure in general. |
64977: colexec: implement vectorized ntile window function r=DrewKimball a=DrewKimball This patch implements the `ntile` window function in the vectorized engine. `ntile` takes in an integer argument `num_buckets` and then distributes all rows in a partition equally between the buckets, outputting the bucket number for each row. In the vectorized implementation, batches are buffered until the end of a partition is reached, at which point the `ntile` bucket values can be calculated. The batches are emitted in a streaming fashion; as soon as a batch is fully processed, it is returned and work is paused until the next call to `Next()`. See #37035 Release note (sql change): the vectorized engine now supports the ntile window function. 65322: jobs: surface the trace_id linked to job execution r=adityamaru a=adityamaru Please see individual commit messages. Informs: #64992 65415: authors: add annezhu98 to authors r=annezhu98 a=annezhu98 Release note: None 65418: authors: add <IsaDash> to authors r=IsaDash a=IsaDash Release note: None Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Anne Zhu <[email protected]> Co-authored-by: Isa Dash <[email protected]>
65324: cli: add cockroach debug job-trace r=pbardea,knz a=adityamaru This change adds a new cli command - `cockroach debug job-trace` that can be used to grab information from inflight trace spans associated with the execution of a particular job. The command takes a `jobID` and file destination where the information will be dumped. It also requires a `--url` of the node from which to pull the inflight spans. In the future we will have a means of pulling inflight spans from the cluster, but at the moment this command must be run against every node to get a complete picture of job execution. Informs: #64992 Release note (cli change): Adds a `cockroach debug job-trace` command that takes 2 arguments: <jobID> and file destination, along with a `--url` pointing to the node on which to execute this command. The command pulls information about inflight trace spans associated with the job and dumps it to the file destination. Co-authored-by: Aditya Maru <[email protected]>
66089: cli: order span payloads by start_time for readability r=knz a=adityamaru During job execution, there can be several root spans on a node. Egs: resumer, processor While the recordings within a span are sorted by start time, since the root spans are stored in an unordered map, the recordings across root spans might not be sorted by start_time. When viewing the output files for a job this results in the processor span printing its payload before the job payload which is not intuitive. We might change how we display recordings in the future, but for the time being this fix makes the "ordering" of events deterministic. Informs: #64992 Release note: None Co-authored-by: Aditya Maru <[email protected]>
…rded This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded at. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, than all events for the parent span followed by all events for the child span. Currently we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: cockroachdb#64992
65559: tracing,tracingservice: adds a trace service to pull clusterwide trace spans r=irfansharif,abarganier a=adityamaru Previously, every node in the cluster had a local inflight span registry that was aware of all the spans that were rooted on that particular node. Child spans of a given traceID executing on a remote node would only become visible to the local registry once execution completes, and the span pushes its recordings over gRPC to the "client" node. This change introduces a `tracingservice` package. Package tracingservice contains a gRPC service to be used for remote inflight span access. It is used for pulling inflight spans from all CockroachDB nodes. Each node will run a trace service, which serves the inflight spans from the local span registry on that node. Each node will also have a trace client dialer, which uses the nodedialer to connect to another node's trace service, and access its inflight spans. The trace client dialer is backed by a remote trace client or a local trace client, which serves as the point of entry to this service. Both clients support the `TraceClient` interface, which includes the following functionalities: - GetSpanRecordings The spans for a traceID are sorted by `StartTime` before they are returned. The per-node trace dialer has yet to be hooked up to an appropriate location depending on where we intend to use it. Resolves: #60999 Informs: #64992 Release note: None 66149: cloud: fix gcs to resuming reader r=dt a=adityamaru This change does a few things: 1. gcs_storage was not returning a resuming reader as a result of which the Read method of the resuming reader that contains logic to retry on certain kinds of errors was not being invoked. 2, Changes the resuming reader to take a storage specific function that can define what errors are retryable in the resuming reader. All storage providers use the same deciding function at the moment and so behavior is unchanged. Release note: None 66152: storage: Disable read sampling and read compactions r=sumeerbhola a=itsbilal Read-triggered compactions are already disabled on 21.1. As the fixes to address known shortcomings with read-triggered compactions are a bit involved (see cockroachdb/pebble#1143 ), disable the feature on master until that issue is fixed. That prevents this known issue from getting in the way of performance experiments. Release note: None. 66155: sql: drop "cluster" from EXPLAIN ANALYZE to improve readability r=maryliag a=maryliag Remove the word "cluster" from "cluster nodes" and "cluster regions" on EXPLAIN ANALYZE to improve readability. Release note: None 66157: sql: add time & contention time to EXPLAIN ANALYZE. r=matthewtodd a=matthewtodd The new fields are labeled `KV time` and `KV contention time`: ``` > EXPLAIN ANALYZE -> UPDATE users SET name = 'Bob Loblaw' -> WHERE id = '32a962b7-8440-4b81-97cd-a7d7757d6eac'; info -------------------------------------------------------------------------------------------- planning time: 353µs execution time: 3ms distribution: local vectorized: true rows read from KV: 52 (5.8 KiB) cumulative time spent in KV: 2ms maximum memory usage: 60 KiB network usage: 0 B (0 messages) cluster regions: us-east1 • update │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ table: users │ set: name │ auto commit │ └── • render │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ └── • filter │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ filter: id = '32a962b7-8440-4b81-97cd-a7d7757d6eac' │ └── • scan cluster nodes: n1 cluster regions: us-east1 actual row count: 52 KV time: 2ms KV contention time: 0µs KV rows read: 52 KV bytes read: 5.8 KiB estimated row count: 50 (100% of the table; stats collected 3 minutes ago) table: users@primary spans: FULL SCAN (42 rows) Time: 4ms total (execution 4ms / network 0ms) ``` Resolves #64200 Release note (sql change): EXPLAIN ANALYZE output now includes, for each plan step, the total time spent waiting for KV requests as well as the total time those KV requests spent contending with other transactions. Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Matthew Todd <[email protected]>
…rded This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded at. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, than all events for the parent span followed by all events for the child span. Currently we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: cockroachdb#64992
…rded This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded at. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, than all events for the parent span followed by all events for the child span. Currently we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: cockroachdb#64992
66147: tracing: wrap structured recordings with time at which they were recorded r=abarganier a=adityamaru This change wrap the existing Structured record in a proto that has a `time` field, indicating what time the record was recorded. This is similar to how we wrap free form logs in LogRecord and will help interleaving child recordings between parent recordings, at the time when they occurred rather than at the span start time. This change was motivated by the fact that jobs are going to be using structured recordings to indicate which phase of execution they are currently in. When consuming this information, it is more intuitive to see interleaved events, rather than all events for the parent span followed by all events for the child span. Currently, we see: ``` ProcSentExportRequest <- processor level ProcReceivedExportResponse ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ``` We would like to see: ``` ProcSentExportRequest <- processor level ExportRequestEvalStarted <- export.go level ExportRequestEvalFinished ProcReceivedExportResponse ``` Informs: #64992 Co-authored-by: Aditya Maru <[email protected]>
63488: monitoring: renovate grafana dashboards r=kzh a=sai-roach This PR adds in renovated grafana dashboards that aim for feature parity with DB Console metrics page. Dashboards: - Overview - Hardware - Runtime - SQL - Storage - Replication - Distributed - Queues - Slow Requests - Changefeeds These dashboards can be previewed by following the instructions in the monitoring [README.md](https://github.com/cockroachdb/cockroach/blob/master/monitoring/README.md) for spinning up a quick grafana instance. Release note (ops change): The grafana dashboards have been updated to more closely resemble DB console metrics. 66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru Previously, the trace collector was dialing up a node, visiting all the root spans on the nodes inflight registry, and placing `tracingpb.RecordedSpans` into a flat slice. This caused loss of information about which spans belonged to a chain rooted at a fixed root span. Such a chain is referred to as a `tracing.Recording`. Every node can have multiple `tracing.Recording`s with the same `trace_id`, and they each represent a traced remote operation. This change maintains the `tracing.Recording` grouping of spans by getting the collector to return a `[]tracing.Recording` for each node. The collectors' unit of iteration consequently becomes a `tracing.Recording`. This makes more sense when you think about how we want to consume these traces. Every `tracing.Recording` is a new traced remote operation, and should be visualized as such in Jaegar, JSON etc. This change also augments the collector iterator to return the nodeID of the node that the current `tracing.Recording` belongs too. Informs: #64992 Release note: None 66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss fixes #66683 Release note: None Co-authored-by: sai-roach <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
66042: cloud: add trace recording to retry loops in external storage r=pbardea a=adityamaru This change adds some tracing to the retry loops we have in external storage. Informs: #64992 Release note: None 66678: tracing: collector was incorrectly flattening recordings r=irfansharif,abarganier a=adityamaru Previously, the trace collector was dialing up a node, visiting all the root spans on the nodes inflight registry, and placing `tracingpb.RecordedSpans` into a flat slice. This caused loss of information about which spans belonged to a chain rooted at a fixed root span. Such a chain is referred to as a `tracing.Recording`. Every node can have multiple `tracing.Recording`s with the same `trace_id`, and they each represent a traced remote operation. This change maintains the `tracing.Recording` grouping of spans by getting the collector to return a `[]tracing.Recording` for each node. The collectors' unit of iteration consequently becomes a `tracing.Recording`. This makes more sense when you think about how we want to consume these traces. Every `tracing.Recording` is a new traced remote operation, and should be visualized as such in Jaegar, JSON etc. This change also augments the collector iterator to return the nodeID of the node that the current `tracing.Recording` belongs too. Informs: #64992 Release note: None 66691: cloud: bump orchestrator version to 21.1.2 r=celiala a=celiala Release note: None 66715: workload: make rand workload aware of computed columns r=mgartner a=rafiss fixes #66683 Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Celia La <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
66679: sql,tracing: introduce crdb_internal.cluster_inflight_traces r=abarganier a=adityamaru This change adds a new indexed, virtual table `crdb_internal.cluster_inflight_traces`. This table surfaces cluster-wide inflight traces for the trace_id specified via an index constraint. Each row in the virtual table corresponds to a `tracing.Recording` on a particular node for the given trace ID. A `tracing.Recording` is the trace of a single operation rooted at a root span on that node. Under the hood, the virtual table contacts all "live" nodes in the cluster via the trace collector which streams back a recording at a time. The table has 3 additional columns that surface the raw JSON, string, and JaegarJSON representation of the recording. These formats are what we dump in a stmt bundle as well, and have been considered the best way to consume traces. This table is not meant to be consumed directly via the SQL shell but will have CLI wrapper built on top of it that will assimilate and write the traces to files. Similar to how we dump a stmt bundle. This change also tweaks some of the recording->string methods to include StructuredRecords. Informs: #64992 Release note (sql change): adds a virtual table `crdb_internal.cluster_inflight_traces` which surfaces cluster-wide inflight traces for the trace_id specified via an index constraint. The output of this table is not appropriate to consume over a SQL connection; follow up changes will add CLI wrappers to make the interaction more user-friendly. 66837: sql/schemachanger: reorder args on Build, clone nodes, minor renaming r=fqazi a=ajwerner Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
65725: backupccl: add some span recordings to the backup resumer r=pbardea a=adityamaru This change adds some initial span recordings to the backup resumer that might be interesting when inspecting an executing backup. The processor and KV level requests will be dealt with in subsequent commits. Informs: #64992 Release note: None Co-authored-by: Aditya Maru <[email protected]>
This is now being tracked in https://cockroachlabs.atlassian.net/browse/CRDB-8964 |
This is a tracking issue for the investigation and work that is being done to improve observability into job execution. The initial brainstorming document can be found at https://docs.google.com/document/d/19VEvNNW8wdep5cYe4TJQY5_WvrmDqTgnh3V2Bq33gag/edit?usp=sharing.
Tracing related work
Aggregating
Finish()'ed
and in-flight spans from all nodes in the cluster.Mutable/ephemeral payloads for more observability into operations that can take long/hang indefinitely.
Filtering based on payload type to avoid job recordings from spamming traces included in reports.
Memory footprint of long-lived spans (egs: CDC that could run for the lifetime of the cluster).
Jobs related work
Persistent storage to deal with node shutdown and stopped SQL pods, via asynchronous flushing of tracing spans from RAM to a system table.
Builtins, tables, and introspection tools built on top of the system table and inflight span aggregation infrastructure.
Epic CRDB-7076
Jira issue: CRDB-7375
The text was updated successfully, but these errors were encountered: