Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: tracing support in Prow jobs #30010

Closed
howardjohn opened this issue Jul 5, 2023 · 8 comments
Closed

Enhancement: tracing support in Prow jobs #30010

howardjohn opened this issue Jul 5, 2023 · 8 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@howardjohn
Copy link
Contributor

howardjohn commented Jul 5, 2023

What would you like to be added: Support for distributed tracing in Prow. More details on what this means below

Why is this needed: To give visibility into job execution, both in a single job and in aggregate.


The end result we are looking for is to be able to generate a trace roughly like the following:

2023-07-03_11-31-44

This was done via a POC, I think the real one can have more information.

Prior Art

https://gitlab.com/gitlab-org/gitlab/-/issues/338943
https://buildkite.com/docs/agent/v3/tracing
https://plugins.jenkins.io/opentelemetry/
https://github.com/tektoncd/community/blob/main/teps/0124-distributed-tracing-for-tasks-and-pipelines.md

Implementation

Prow job tracing primarily involves two parts: the infrastructure components, and the actual test logic. These should be formed into a single cohesive trace (see picture above, test logic is in yellow).

Test Logic

For the most part, how a test handles tracing is outside of scope of prow - it is the job author's responsibility. However, one aspect that needs care is ensuring spans reported by the test attach to the same trace as the infrastructure spans.

This is done by propagation. However, the typical way this is handle is by HTTP headers (traceparent) in distributed systems; this doesn't work here. While there is no ratified standard for doing this otherwise, there is a growing de-facto standard (see the prior arts) to use TRACEPARENT environment variable (open-telemetry/opentelemetry-specification#740). This seems well suited. This environment variable will need to be passed to the Pod environment and respected when the job sends traces.

Sending traces from the job is fairly straightforward from that point on. They will need to configure the job to send to the same tracing backend, of course, but otherwise can just send traces like normal. One issue may be that many jobs are largely bash; https://github.com/equinix-labs/otel-cli seems well suited to handle these cases.

Prow Infra

For the infra side, we will need to report spans about a variety of things. I think some interesting things to measure are:

  • End to end job execution, as the root span
  • Pod start - end
  • Pod scheduling, image pulling, etc
  • Containers running
  • Actions within these containers - for example, git operations within clonerefs

I think there are two main approaches to this:

  1. Making a tracing reporter. This can look at the ProwJob and maybe other artifacts (clone-records.json) and compute the spans after the fact (its perfectly fine to send spans out of order and in the past).

This is POCed in https://github.com/howardjohn/prow-tracing (as a standalone binary that is pointed at a historic job).

This approach seems the least invasive to me, and is pretty effective I think.

One concern here is that since we are creating the spans after the job runs, we cannot set the TRACEPARENT environment variable on the job in this approach. There are a few options to this. Either we do a bit of the next option and add just the root span outside of the reporter, or we can abuse the fact that trace IDs are globally unique 16 bytes -- just like the prowjob build UID. Using this fact, we can always create the root span with an ID of the build, and test execution can use PROW_BUILD_ID when TRACEPARENT is not set (or that var can be set automatically by prow). This approach is taken in the POC above

  1. Native integration

Rather than retroactive analysis, we can do 'proper' tracing and integrate it throughout prow. This would allow us to generate extremely fine grained traces about whatever we want. The risk is that it permeates the entire codebase, unlike the reporter mode which is completely standalone.

Configuration

I propose this only supports OpenTelemetry, which is the only recommended option these days. Within otel, though, there are a variety of "exporters" allowed. The primary one is "OTLP". This is a common protocol implemented by many vendors. In addition, otel offers a collector which accepts OTLP and does a variety of things, including exporting to anywhere.

One notable vendor that does not support OLTP is GCP tracing. I think most Prow users are using GCP, so this is a natural backend to use.

We could support OTLP + GCP, or just OTLP and GCP users can deploy a collector.

So overall I think we will only need a couple config items for the collector endpoint and maybe a few others

@howardjohn howardjohn added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 5, 2023
@howardjohn
Copy link
Contributor Author

I wrote up more about shell tracing. Not directly related but likely to be used in parallel: https://blog.howardjohn.info/posts/shell-tracing/

@michelle192837
Copy link
Contributor

/sig testing
/cc @cjwagner @petr-muller

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 11, 2023
@michelle192837
Copy link
Contributor

/assign @cjwagner @petr-muller

howardjohn added a commit to howardjohn/tools that referenced this issue Jul 14, 2023
This will be used in conjunection with
kubernetes/test-infra#30010 and
https://blog.howardjohn.info/posts/shell-tracing/ to give us tracing of
job execution to help understand and analyze job/test execution better.

The tool is 18mb so pretty low cost.
istio-testing pushed a commit to istio/tools that referenced this issue Jul 14, 2023
This will be used in conjunection with
kubernetes/test-infra#30010 and
https://blog.howardjohn.info/posts/shell-tracing/ to give us tracing of
job execution to help understand and analyze job/test execution better.

The tool is 18mb so pretty low cost.
@petr-muller
Copy link
Member

Sorry for not getting to this sooner, things are hard to follow in summer between vacations and catching up with backlog after coming from vacation.

I am very fond of the proposed feature in general, will need to read the proposal better to discuss the details - I will get to that this week.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants