Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streaming: epoch-level distributed tracing #10000

Open
BugenZhao opened this issue May 25, 2023 · 4 comments
Open

streaming: epoch-level distributed tracing #10000

BugenZhao opened this issue May 25, 2023 · 4 comments
Assignees
Labels
component/dev Dev related issues, such as rise dev, ci. component/streaming Stream processing related issue. no-issue-activity type/feature

Comments

@BugenZhao
Copy link
Member

BugenZhao commented May 25, 2023

See #9905 (comment) for background.

Recently we're putting efforts into improving the stability of RisingWave under a high workload. An observation is that it's common to have the barrier latency increase abnormally after some time, possibly due to performance regression of storage or executor cache as data grows. In this case, we have to spend time investigating the cause of the latency increase and locate the problematic executor.

There's a common technique of "distributed tracing" that tracks an event as it flows through different components of a distributed system, which allows developers to troubleshoot possible issues during that. Typically, this is designed for ah-hoc requests like batch queries or serving point-gets. However, since we're able to cut the infinite streaming job into the granularity of epochs, we can also treat each epoch as a separate finite event to apply it.

By tracing the barrier flows through each executor, we can easily check which executor spends a lot of time handling the data in this epoch.

@github-actions github-actions bot added this to the release-0.20 milestone May 25, 2023
@BugenZhao BugenZhao added type/feature component/streaming Stream processing related issue. component/dev Dev related issues, such as rise dev, ci. labels May 25, 2023
@lmatz
Copy link
Contributor

lmatz commented May 25, 2023

It can be the streaming version of explain analyze XXX_NAME_MV_OR_SINK.

@fuyufjh
Copy link
Member

fuyufjh commented Jun 9, 2023

This would be really helpful, both for users and for us

@BugenZhao
Copy link
Member Author

BugenZhao commented Jun 21, 2023

After #10315 and #10417, this feature is generally available for developers in local development. 🎉 Updated guides:

Preview

image

How to read this timeline

  • Different colors represent different nodes in the cluster.
  • Click on the span to get more structural information.
  • The meta span starts when a barrier (with current epoch x) is about to inject, ends when the next barrier (with previous epoch x) is fully collected and committed. This includes the whole lifetime of the epoch x in the system.
  • There'll be an "info event" (marked as a black | in the span) indicating that the next barrier is injected. Therefore, the time from this symbol to the end of the span will be the barrier latency of the next barrier.
  • For each actor or executor, the span starts when it is first polled after the barrier is yielded, ends just before it yields the next barrier.

How to enable distributed tracing

  • For local development, enable Tracing with risedev configure and add use: grafana and use: tempo to the RiseDev profile. After launching, navigate to the risingwave_traces dashboard in Grafana and click on the latest trace ID.

  • For manual deployments, start any OTLP-compatible tracing service and set RW_TRACING_ENDPOINT env to its OTLP gRPC server.
  • For cloud deployment, the managed service is a WIP.

How does it work

  • The core idea is to serialize the tracing span to a trace context and propagate it through the wire. A utility named TracingContext is introduced in this PR.
  • For standard RPC calls, a middleware can be introduced and do this automatically in the request headers. (not in these PRs)
  • There's no intuitive client-server hierarchy for barrier propagation. So we put the trace context manually in the Barrier proto and other related request bodies.

How to add more spans or events here

  • All existing logs (events) will be recorded in traces, following the same filtering configuration as stdout.
  • Add target: "rw_tracing" to an event! to only show it in traces without outputting it into the log.
  • To add spans in the scope of the execution, follow the documentation here.
  • In the future, we may add tracing to meta RPCs and batch queries as well.

Integration with Grafana

Grafana supports "trace to metrics" and "trace to logs", which enables us to navigate between data in different forms and establish associations for them. We can adopt them to provide better observability in the future.

Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/dev Dev related issues, such as rise dev, ci. component/streaming Stream processing related issue. no-issue-activity type/feature
Projects
None yet
Development

No branches or pull requests

3 participants