-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added agent lifecycle enhancements #1716
base: main
Are you sure you want to change the base?
Conversation
This commit adds support for a couple of features which, if turned on, apply to all agent containers generated by the chart. 1. it adds agents.lifecycle, which accepts a LifecycleHandler ref: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#lifecyclehandler-v1-core 2. it adds agents.terminationGracePeriodSeconds, which applies to all agent pods if defined. What is the motivation for this commit? When a Kubernetes node is being shut down, it will send the TERM signal to all non-critical pods, including Datadog. Datadog might (and often does) finish shutting down before the application pods, which gives us a blind spot during a very important transition. Adding these options allows us to add, for instance, a preStop hook which will keep the agent alive for a little longer while connections are drained and other pods are shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! I left some comments, the main one being excluding agents.lifecycle
for .Values.providers.gke.autopilot
.
{{- if .Values.agents.lifecycle }} | ||
lifecycle: | ||
{{ toYaml .Values.agents.lifecycle | indent 4 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should gate lifecycle.[postStart|preStop].exec.command
for GKE Autopilot since those fields are restricted in Autopilot. All of the other lifecycle.[postStart|preStart].*
fields are allowed though.
Let's just not support agents.lifecycle
for autopilot for this PR and we can work on adding minimal support for it in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{- if .Values.agents.lifecycle }} | |
lifecycle: | |
{{ toYaml .Values.agents.lifecycle | indent 4 }} | |
{{- end }} | |
{{- if and (.Values.agents.lifecycle) (not .Values.providers.gke.autopilot) }} | |
lifecycle: | |
{{ toYaml .Values.agents.lifecycle | indent 4 }} | |
{{- end }} |
This change should be added to all the container definitions except for security-agent
and otel-agent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fanny-jiang can you clarify why not for these two? In this PR I'm adding the lifecycle block for all of them. Are they prevented from being instantiated in Autopilot via a different mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed changes aligned with what you said above, not touching security or otel agent. Please let me know if these will require adjustments as well
Thank you, @luizsoares-clickup , for the excellent summary in the PR description. You are setting the bar for the rest of us. |
Thank you for reviewing! The autopilot thing makes a lot of sense, I totally wrote this for my own selfish case in EKS, I didn't think too much about all the other stuff you guys have to support. I'll take a look at the suggested changes and will push fixes by tomorrow 🙏 |
Also renamed the tests in the ci/ folder to align on file naming patterns
What this PR does / why we need it:
This PR adds two new chart values.
agents.lifecycle
, if defined, will provide us with a way to inject a Lifecycle block into the agents PodSpecsagents.terminationGracePeriodSeconds
, when set, alters the amount of time Kubernetes waits until forcibly terminating the agents pods.Why?
When a Kubernetes node is being terminated, the kubelet signals everything non-critical at the same time. It is possible for the agent to finish the shutdown faster than the application pods, which leaves us with no metrics while app connections are drained.
Which issue this PR fixes
N/A
Special notes for your reviewer:
Checklist
[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]
.github/helm-docs.sh
)CHANGELOG.md
has been updatedREADME.md
make update-test-baselines
)