Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added agent lifecycle enhancements #1716

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

luizsoares-clickup
Copy link

What this PR does / why we need it:

This PR adds two new chart values.

agents.lifecycle, if defined, will provide us with a way to inject a Lifecycle block into the agents PodSpecs
agents.terminationGracePeriodSeconds, when set, alters the amount of time Kubernetes waits until forcibly terminating the agents pods.

Why?

When a Kubernetes node is being terminated, the kubelet signals everything non-critical at the same time. It is possible for the agent to finish the shutdown faster than the application pods, which leaves us with no metrics while app connections are drained.

Which issue this PR fixes

N/A

Special notes for your reviewer:

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

  • Chart Version bumped
  • Documentation has been updated with helm-docs (run: .github/helm-docs.sh)
  • CHANGELOG.md has been updated
  • Variables are documented in the README.md
  • For Datadog Operator chart or value changes update the test baselines (run: make update-test-baselines)

This commit adds support for a couple of features which, if turned on,
apply to all agent containers generated by the chart.

1. it adds agents.lifecycle, which accepts a LifecycleHandler

ref: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#lifecyclehandler-v1-core

2. it adds agents.terminationGracePeriodSeconds, which applies to all
   agent pods if defined.

What is the motivation for this commit?

When a Kubernetes node is being shut down, it will send the TERM signal
to all non-critical pods, including Datadog. Datadog might (and often
does) finish shutting down before the application pods, which gives us
a blind spot during a very important transition.

Adding these options allows us to add, for instance, a preStop hook
which will keep the agent alive for a little longer while connections
are drained and other pods are shut down.
@luizsoares-clickup luizsoares-clickup requested review from a team as code owners February 24, 2025 17:00
Copy link
Contributor

@fanny-jiang fanny-jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I left some comments, the main one being excluding agents.lifecycle for .Values.providers.gke.autopilot.

Comment on lines 5 to 8
{{- if .Values.agents.lifecycle }}
lifecycle:
{{ toYaml .Values.agents.lifecycle | indent 4 }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should gate lifecycle.[postStart|preStop].exec.command for GKE Autopilot since those fields are restricted in Autopilot. All of the other lifecycle.[postStart|preStart].* fields are allowed though.

Let's just not support agents.lifecycle for autopilot for this PR and we can work on adding minimal support for it in a future PR.

Copy link
Contributor

@fanny-jiang fanny-jiang Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{- if .Values.agents.lifecycle }}
lifecycle:
{{ toYaml .Values.agents.lifecycle | indent 4 }}
{{- end }}
{{- if and (.Values.agents.lifecycle) (not .Values.providers.gke.autopilot) }}
lifecycle:
{{ toYaml .Values.agents.lifecycle | indent 4 }}
{{- end }}

This change should be added to all the container definitions except for security-agent and otel-agent

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanny-jiang can you clarify why not for these two? In this PR I'm adding the lifecycle block for all of them. Are they prevented from being instantiated in Autopilot via a different mechanism?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed changes aligned with what you said above, not touching security or otel agent. Please let me know if these will require adjustments as well

@FredMora
Copy link

Thank you, @luizsoares-clickup , for the excellent summary in the PR description. You are setting the bar for the rest of us.

@luizsoares-clickup
Copy link
Author

Thank you for your contribution! I left some comments, the main one being excluding agents.lifecycle for .Values.providers.gke.autopilot.

Thank you for reviewing! The autopilot thing makes a lot of sense, I totally wrote this for my own selfish case in EKS, I didn't think too much about all the other stuff you guys have to support.

I'll take a look at the suggested changes and will push fixes by tomorrow 🙏

Also renamed the tests in the ci/ folder to align on file naming patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants