Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Agent Feature Support #1559

Merged
merged 33 commits into from
Dec 20, 2024
Merged

OTel Agent Feature Support #1559

merged 33 commits into from
Dec 20, 2024

Conversation

mackjmr
Copy link
Member

@mackjmr mackjmr commented Dec 11, 2024

What does this PR do?

  • Add support for Otel Agent as a Feature
  • Remove support for OTel Agent as a flag
  • Maintains support for OTel Agent as an annotation temporarily

Motivation

OTEL-2290

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

This PR adds support for deploying the otel-agent in a new manner. In order to confirm these changes work as expected, we need to validate that the otel agent runs without error, but also that it is able to receive data from OTel clients and forward it to Datadog.

The following docker image: mackjmr/app-all:v2 contains an app that sends OTLP traces every 5 seconds, which we can use to send data to the otel agent.

This app can be deployed using the following deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-all
  labels:
    app: app-all
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app-all
  template:
    metadata:
      labels:
        app: app-all
    spec:
      containers:
      - name: app-all
        image: mackjmr/app-all:v2
        env:
        - name: MY_NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://$(MY_NODE_IP):4317
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "k8s.pod.ip=$(POD_IP),service.name=test-operator-changes"

Noteworthy:

  • OTEL_EXPORTER_OTLP_ENDPOINT is sent to send traces to the NODE_IP:4317 (otel agent listens on 4317 and binds container port to host port)
  • The service name is set via env var OTEL_RESOURCE_ATTRIBUTES.

Known issues when testing:
Agent will fail to startup in some kubernetes environments (e.g. Kind cluster) if it fails to detect the hostname. You can prevent this from happening by setting the hostname yourself:

spec:
  override:
    nodeAgent:
      containers: 
        agent:
          env:
            - name: DD_HOSTNAME
              value: "test.node.name"

In Kind/ Minikube, you also need to disable TLS Verify:

  global:
    kubelet:
      tlsVerify: false

Scenarios to test:

1. Otel Agent Feature with collector config in configData:

Test 1: Default ports

Deploy the agent using the following manifest: https://github.com/DataDog/datadog-operator/blob/1a69194e7c958641288a3f0489372acb2cecfa34/examples/datadogagent/datadog-agent-with-otel-agent.yaml.

  • Ensure that the otel-agent starts up without issues/ errors.
  • Once above has been validated, deploy the app-all image mentionned above, and ensure that traces are making it into Datadog

Test 2: Non default ports:

Same steps as above, but in manifest add ports sections:

  features:
    otelCollector:
      enabled: true
      ports:
        - containerPort: 3333
          name: otel-grpc

and change collector config port:

    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:3333

and in app-all deployment change OTEL_EXPORTER_OTLP_ENDPOINT to http://$(MY_NODE_IP):3333.

2. Otel Agent Feature with collector config in configMap:

Deploy the agent using the following manifest: https://github.com/DataDog/datadog-operator/blob/1a69194e7c958641288a3f0489372acb2cecfa34/examples/datadogagent/datadog-agent-with-otel-agent-configmap.yaml.

  • Ensure that the otel-agent starts up without issues/ errors.
  • Once above has been validated, deploy the app-all image mentionned above, and ensure that traces are making it into Datadog

3. Otel Agent Feature without passing collector config:

When we don't pass a collector config, a default config is used.

Test 1: Default ports

Deploy the agent using the following:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <DATADOG_API_KEY>
  features:
    otelCollector:
      enabled: true
  • Ensure that the otel-agent starts up without issues/ errors.
  • Once above has been validated, deploy the app-all image mentionned above, and ensure that traces are making it into Datadog

Test 2: Non default ports:

Deploy the agent using the following:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog
spec:
  global:
    credentials:
      apiKey: <DATADOG_API_KEY>
  features:
    otelCollector:
      enabled: true
      ports:
        - containerPort: 3333
          name: otel-grpc
  • Ensure that the otel-agent starts up without issues/ errors.
  • Once above has been validated, deploy the app-all image mentionned above changing OTEL_EXPORTER_OTLP_ENDPOINT to http://$(MY_NODE_IP):3333, and ensure that traces are making it into Datadog

4. Otel Agent via Annotation:

We need to validate that deploying otel agent with annotations still works.

Deploy the agent using the following manifest: https://github.com/DataDog/datadog-operator/blob/1a69194e7c958641288a3f0489372acb2cecfa34/examples/datadogagent/datadog-agent-with-otel-agent-annotations.yaml.

  • Ensure that the otel-agent starts up without issues/ errors.
  • Once above has been validated, deploy the app-all image mentionned above, and ensure that traces are making it into Datadog

5. Flare:

Test 1: Core config disabled

Disable the core config:

  features:
    otelCollector:
      coreConfig: 
        enabled: false

Trigger a flare (you can check the flare locally, or if you want to submit it, you can use ticket #1970459), and ensure that in otel/otel-agent.log you see log:

'otelcollector.enabled' is disabled in the configuration

Test 2: Core config enabled

The otelcollector.enabled should be enabled by default when the otel collector feature is enabled, so you can leave the coreConfig empty (this config: https://github.com/DataDog/datadog-operator/blob/1a69194e7c958641288a3f0489372acb2cecfa34/examples/datadogagent/datadog-agent-with-otel-agent.yaml will do). Trigger a flare (you can check the flare locally, or if you want to submit it, you can use ticket #1970459), and ensure that the otel directory contains the following data:

2024-12-18_13-27-44

Test 3: Core config other params

Add the following params:

  features:
    otelCollector:
      coreConfig: 
        enabled: true
        extension_timeout: 13
        extension_url: "https://localhost:7777"

describe the pod (kubectl describe pod -n <namespace> <pod_name>), and ensure the following env vars are set:

      DD_OTELCOLLECTOR_ENABLED:                              true
      DD_OTELCOLLECTOR_EXTENSION_TIMEOUT:                    13
      DD_OTELCOLLECTOR_EXTENSION_URL:                        https://localhost:7777

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@mackjmr mackjmr added the enhancement New feature or request label Dec 12, 2024
@mackjmr mackjmr added this to the v1.11.0 milestone Dec 12, 2024
func volumeMountsForInitConfig() []corev1.VolumeMount {
return []corev1.VolumeMount{
common.GetVolumeMountForLogs(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete initContainer configs, is this even in otel agent code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake, meant to remove for otel agent voluments. undid the change

@@ -44,6 +44,8 @@ type DatadogAgentSpec struct {
type DatadogFeatures struct {
// Application-level features

// OTelAgent configuration.
OTelAgent *OTelAgentFeatureConfig `json:"otelAgent,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is otelAgent really best name for this feature? should this be otelCollector or just otel? other features don't mention 'Agent'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed feature from otelAgent to otelCollector.

@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 81.65138% with 40 lines in your changes missing coverage. Please review.

Project coverage is 48.96%. Comparing base (f517775) to head (137220d).

Files with missing lines Patch % Lines
...controller/datadogagent/component/agent/default.go 0.00% 17 Missing ⚠️
...ller/datadogagent/feature/otelcollector/feature.go 88.81% 13 Missing and 3 partials ⚠️
api/datadoghq/v2alpha1/test/builder.go 88.23% 4 Missing and 2 partials ⚠️
...ller/datadogagent/feature/enabledefault/feature.go 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1559      +/-   ##
==========================================
+ Coverage   48.56%   48.96%   +0.39%     
==========================================
  Files         226      227       +1     
  Lines       20435    20626     +191     
==========================================
+ Hits         9925    10100     +175     
- Misses       9989    10001      +12     
- Partials      521      525       +4     
Flag Coverage Δ
unittests 48.96% <81.65%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
api/datadoghq/v2alpha1/datadogagent_default.go 91.10% <100.00%> (+0.14%) ⬆️
api/datadoghq/v2alpha1/datadogagent_types.go 100.00% <ø> (ø)
cmd/main.go 0.00% <ø> (ø)
internal/controller/datadogagent/controller.go 51.85% <ø> (-1.72%) ⬇️
internal/controller/datadogagent/feature/types.go 26.92% <ø> (ø)
internal/controller/setup.go 51.49% <ø> (-0.36%) ⬇️
pkg/defaulting/images.go 100.00% <ø> (ø)
...ller/datadogagent/feature/enabledefault/feature.go 36.62% <0.00%> (+0.14%) ⬆️
api/datadoghq/v2alpha1/test/builder.go 91.52% <88.23%> (-0.29%) ⬇️
...ller/datadogagent/feature/otelcollector/feature.go 88.81% <88.81%> (ø)
... and 1 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f517775...137220d. Read the comment docs.

@mackjmr mackjmr marked this pull request as ready for review December 13, 2024 14:47
@mackjmr mackjmr requested review from a team as code owners December 13, 2024 14:47
@levan-m levan-m modified the milestones: v1.11.0, v1.12.0 Dec 16, 2024
@@ -104,6 +104,11 @@ func agentImage() string {
return fmt.Sprintf("%s/%s:%s", v2alpha1.DefaultImageRegistry, v2alpha1.DefaultAgentImageName, defaulting.AgentLatestVersion)
}

func otelAgentImage() string {
// todo(mackjmr): make this dynamic once we have a non-dev image.
return "datadog/agent-dev:nightly-ot-beta-main"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the ETA of dev image? Will this todo be resolved in this PR?

Copy link
Member Author

@mackjmr mackjmr Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this todo be resolved in this PR?

No, and defering to @dineshg13 for ETA of non-dev image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved with the rest of images constants

GCRContainerRegistry ContainerRegistry = "gcr.io/datadoghq"
// DockerHubContainerRegistry corresponds to the datadoghq docker.io registry
DockerHubContainerRegistry ContainerRegistry = "docker.io/datadog"
// PublicECSContainerRegistry corresponds to the datadoghq PublicECSContainerRegistry registry
PublicECSContainerRegistry ContainerRegistry = "public.ecr.aws/datadog"
// DefaultImageRegistry corresponds to the datadoghq containers registry
DefaultImageRegistry = GCRContainerRegistry // TODO: this is also defined elsewhere and not used; consolidate
// JMXTagSuffix prefix tag for agent JMX images
JMXTagSuffix = "-jmx"
agentImageName = "agent"
clusterAgentImageName = "cluster-agent"
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still discussing with Dinesh what image to use, I'll keep this in mind when I update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@levan-m moved to images.go in d1345c2

},
}
}
o.ports = dda.Spec.Features.OtelCollector.Ports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not assign the configured ports' structure directly to o.ports? As a general practice, we don't modify DDA withing the features, instead save relevant config in feature struct otelCollectorFeature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated in: ff7868e.

Copy link
Contributor

@levan-m levan-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed latest commits and looks good to me. Please verify the test cluster state isn't caused by this change before merging.

@mackjmr mackjmr merged commit e2f08ba into main Dec 20, 2024
19 checks passed
@mackjmr mackjmr deleted the mackjmr/otel-agent-feature-support branch December 20, 2024 17:29
@wspurgin
Copy link

wspurgin commented Jan 3, 2025

How would one enable OTEL prior to this PR if using the operator? I see that there's a command line flag, but it's unclear from the docs how to use those in a datadog manifest.

Relatedly, is there any target release planned that will include this feature?

@levan-m
Copy link
Contributor

levan-m commented Jan 3, 2025

Hi @wspurgin, command line flag was added to enable internal testing using overrides section in the DatadogAgent CRD. This PR delivers feature support and will be released with v1.12, ETA early February.

@wspurgin
Copy link

wspurgin commented Jan 3, 2025

Thanks for the response @levan-m - so am I right in my understanding that there is not any way a user of the DD operator helm chart could enable the "beta" OTeL flag themselves?

Is there any concern with allowing us to opt in for the time being (for users of the operator from versions v1.8 -> v1.11.1)? Otherwise I'll just have to drop the operator entirely and configure the agents in Helm directly (the operator is the recommended way, so I'd prefer obviously not to undo my last week of work to do the non-recommended approach 😓)

@levan-m
Copy link
Contributor

levan-m commented Jan 3, 2025

@wspurgin, there are ways to enabled it (render helm chart and edit manifest or use annotation added here #1475) however main challenge will be correctly configuring the OTEL agent.

I'd recommend waiting till next week for 1.12 release candidate and use the feature added with this PR. For the RC setup you will need to install CRDs manually as helm chart isn't getting updated until final release.

@wspurgin
Copy link

wspurgin commented Jan 3, 2025

Thanks @levan-m - I'll just wait for that RC. I tried the annotation (but the agent would crash with an error saying otel-agent didn't exist) - like you said I'm probably missing some config for that to work properly. I'll be on the lookout for that RC once it's out! Thanks again 👍

@swang392 swang392 mentioned this pull request Jan 7, 2025
2 tasks
mackjmr added a commit that referenced this pull request Jan 14, 2025
This PR removes support for enabling otel-agent via annotations in favor of enabling otel agent vie otelCollector feature: #1559.

The annotation use has already been removed in staging: DataDog/k8s-datadog-agent-ops#4602.
@mackjmr mackjmr mentioned this pull request Jan 14, 2025
2 tasks
fanny-jiang pushed a commit that referenced this pull request Jan 14, 2025
This PR removes support for enabling otel-agent via annotations in favor of enabling otel agent vie otelCollector feature: #1559.

The annotation use has already been removed in staging: DataDog/k8s-datadog-agent-ops#4602.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants