streaming: epoch-level distributed tracing #10000

BugenZhao · 2023-05-25T05:43:24Z

Recently we're putting efforts into improving the stability of RisingWave under a high workload. An observation is that it's common to have the barrier latency increase abnormally after some time, possibly due to performance regression of storage or executor cache as data grows. In this case, we have to spend time investigating the cause of the latency increase and locate the problematic executor.

There's a common technique of "distributed tracing" that tracks an event as it flows through different components of a distributed system, which allows developers to troubleshoot possible issues during that. Typically, this is designed for ah-hoc requests like batch queries or serving point-gets. However, since we're able to cut the infinite streaming job into the granularity of epochs, we can also treat each epoch as a separate finite event to apply it.

By tracing the barrier flows through each executor, we can easily check which executor spends a lot of time handling the data in this epoch.

lmatz · 2023-05-25T06:15:01Z

It can be the streaming version of explain analyze XXX_NAME_MV_OR_SINK.

fuyufjh · 2023-06-09T02:55:32Z

This would be really helpful, both for users and for us

BugenZhao · 2023-06-21T09:29:21Z

After #10315 and #10417, this feature is generally available for developers in local development. 🎉 Updated guides:

Preview

How to read this timeline

Different colors represent different nodes in the cluster.
Click on the span to get more structural information.
The meta span starts when a barrier (with current epoch x) is about to inject, ends when the next barrier (with previous epoch x) is fully collected and committed. This includes the whole lifetime of the epoch x in the system.
There'll be an "info event" (marked as a black | in the span) indicating that the next barrier is injected. Therefore, the time from this symbol to the end of the span will be the barrier latency of the next barrier.
For each actor or executor, the span starts when it is first polled after the barrier is yielded, ends just before it yields the next barrier.

How to enable distributed tracing

For local development, enable Tracing with risedev configure and add use: grafana and use: tempo to the RiseDev profile. After launching, navigate to the risingwave_traces dashboard in Grafana and click on the latest trace ID.

For manual deployments, start any OTLP-compatible tracing service and set RW_TRACING_ENDPOINT env to its OTLP gRPC server.
For cloud deployment, the managed service is a WIP.

How does it work

The core idea is to serialize the tracing span to a trace context and propagate it through the wire. A utility named TracingContext is introduced in this PR.
For standard RPC calls, a middleware can be introduced and do this automatically in the request headers. (not in these PRs)
There's no intuitive client-server hierarchy for barrier propagation. So we put the trace context manually in the Barrier proto and other related request bodies.

How to add more spans or events here

All existing logs (events) will be recorded in traces, following the same filtering configuration as stdout.
Add target: "rw_tracing" to an event! to only show it in traces without outputting it into the log.
To add spans in the scope of the execution, follow the documentation here.
In the future, we may add tracing to meta RPCs and batch queries as well.

Integration with Grafana

Grafana supports "trace to metrics" and "trace to logs", which enables us to navigate between data in different forms and establish associations for them. We can adopt them to provide better observability in the future.

github-actions · 2024-07-03T09:59:41Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

github-actions bot added this to the release-0.20 milestone May 25, 2023

BugenZhao added type/feature component/streaming Stream processing related issue. component/dev Dev related issues, such as rise dev, ci. labels May 25, 2023

BugenZhao mentioned this issue Jun 8, 2023

Discussion: replace minstant and minitrace with the standard instant and tracing library #10236

Closed

This was referenced Jun 16, 2023

feat: introduce distributed epoch-level tracing #10315

Merged

observability: make logs structural #10426

Open

kwannoel mentioned this issue Jul 14, 2023

Write tool or doc on how to do full cluster stack trace #9905

Closed

BugenZhao self-assigned this Jul 14, 2023

BugenZhao removed this from the release-1.0 milestone Jul 14, 2023

github-actions bot added the no-issue-activity label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming: epoch-level distributed tracing #10000

streaming: epoch-level distributed tracing #10000

BugenZhao commented May 25, 2023 •

edited

Loading

lmatz commented May 25, 2023 •

edited

Loading

fuyufjh commented Jun 9, 2023

BugenZhao commented Jun 21, 2023 •

edited

Loading

github-actions bot commented Jul 3, 2024

streaming: epoch-level distributed tracing #10000

streaming: epoch-level distributed tracing #10000

Comments

BugenZhao commented May 25, 2023 • edited Loading

lmatz commented May 25, 2023 • edited Loading

fuyufjh commented Jun 9, 2023

BugenZhao commented Jun 21, 2023 • edited Loading

Preview

How to read this timeline

How to enable distributed tracing

How does it work

How to add more spans or events here

Integration with Grafana

github-actions bot commented Jul 3, 2024

BugenZhao commented May 25, 2023 •

edited

Loading

lmatz commented May 25, 2023 •

edited

Loading

BugenZhao commented Jun 21, 2023 •

edited

Loading