-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming: epoch-level distributed tracing #10000
Comments
It can be the streaming version of |
This would be really helpful, both for users and for us |
After #10315 and #10417, this feature is generally available for developers in local development. 🎉 Updated guides: Preview![]() How to read this timeline
How to enable distributed tracing
How does it work
How to add more spans or events here
Integration with GrafanaGrafana supports "trace to metrics" and "trace to logs", which enables us to navigate between data in different forms and establish associations for them. We can adopt them to provide better observability in the future. |
This issue has been open for 60 days with no activity. If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the You can also confidently close this issue as not planned to keep our backlog clean. |
See #9905 (comment) for background.
Recently we're putting efforts into improving the stability of RisingWave under a high workload. An observation is that it's common to have the barrier latency increase abnormally after some time, possibly due to performance regression of storage or executor cache as data grows. In this case, we have to spend time investigating the cause of the latency increase and locate the problematic executor.
There's a common technique of "distributed tracing" that tracks an event as it flows through different components of a distributed system, which allows developers to troubleshoot possible issues during that. Typically, this is designed for ah-hoc requests like batch queries or serving point-gets. However, since we're able to cut the infinite streaming job into the granularity of epochs, we can also treat each epoch as a separate finite event to apply it.
By tracing the barrier flows through each executor, we can easily check which executor spends a lot of time handling the data in this epoch.
The text was updated successfully, but these errors were encountered: