feat: ability to emit Kubernetes events #411

trutx · 2021-04-15T17:35:16Z

Fixes #414

Description of changes:

This PR adds the ability to emit Kubernetes events when interruption signals are received and when actions like cordon or drain are taken on nodes.

Kubernetes events are very useful for monitoring. At New Relic we push all cluster events using an eventrouter up to New Relic One and then we create alerts and dashboards based on such events. This PR makes NTH emit a Kubernetes event every time a signal is received from AWS and also every time an action is taken on the node NTH is running on, regardless if it ends in success or failure.

The involvedObject for these events is the Kubernetes node NTH runs on, so they are published in the default namespace. To get them run kubectl get events --field-selector source=aws-node-termination-handler.

Additionally annotations can be attached to each event. Such annotations are useful for filtering and discovery. I.e. each event can be stamped with an annotation stating the cluster name the event belongs to, the team owning the cluster, etc. These annotations can later be used in the New Relic UI to filter out events belonging to a specific subset via a NRQL query, or can be used to filter the results in a kubectl get events query piped to a processor like jq or yq.

trutx · 2021-04-15T17:37:23Z

cmd/node-termination-handler.go

@@ -320,11 +335,14 @@ func drainOrCordonIfNecessary(interruptionEventStore *interruptioneventstore.Sto
 				log.Err(err).Msgf("node '%s' not found in the cluster", nodeName)
 			} else {
 				log.Err(err).Msg("There was a problem while trying to cordon and drain the node")
+				metrics.NodeActionsInc("cordon-and-drain", nodeName, err)


Added the metrics line here too so a metric is also created when an error occurs.

trutx · 2021-04-15T20:40:55Z

Looks like tests pass and fail randomly, they passed locally before pushing the PR. Will check tomorrow.

trutx · 2021-04-16T10:15:44Z

The last CordonAndDrain event wasn't caught sometimes because the test exited too quickly. Adding a retry loop solved the issue.

test/e2e/prometheus-metrics-test

trutx · 2021-04-21T13:16:53Z

I put this PR on a real world scenario at work and I instantly missed the emitter node AZ and instance type in the events, with which I plan to create reports like types of interruptions by AZ and types of interruptions by instance type.

So I took a second pass here and added a default set of annotations to all events, basically all the data that was already gathered from IMDS by ec2metadata.GetNodeMetadata().

haugenj

Thanks for the pr! I was using events to debug something the other day, so I can see how this would be useful.

Overall looks pretty good to me, and I appreciate you taking the time to clean up some of the docs and the prometheus test. Would you mind posting an example of how these get rendered with kubectl get events ... ?

haugenj · 2021-04-22T19:15:45Z

pkg/observability/k8s-events.go

+// Emit a Kubernetes event for the current node and with the given event type, reason and message
+func (r K8sEventRecorder) Emit(eventType, eventReason, eventMsgFmt string, eventMsgArgs ...interface{}) {
+	if r.enabled {
+		r.AnnotatedEventf(r.node, r.annotations, eventType, eventReason, eventMsgFmt, eventMsgArgs...)


in queue processor mode, node might not be the node that's being cordoned and drained, is that intentional? perhaps we should pass in node ?

Good point, totally omitted that. Revamped it so instead of getting the Node from the API when initialising the recorder we just create an ObjectReference to a Node object with a parameterised name. Such name will be passed in every time recorder.Emit() is called and it will come from the nodeName stored in every particular interruption event, thus making it valid both for IMDS and SQS processors if I'm not mistaken.

haugenj · 2021-04-22T19:17:01Z

pkg/observability/k8s-events.go

+	annotations["public-ipv4"] = nodeMetadata.PublicIP
+	annotations["region"] = nodeMetadata.Region
+
+	// Parse extra annotations


nit - you can remove the comments from here through the end of the function, they're just stating what the next lines show

haugenj · 2021-04-22T19:17:29Z

pkg/observability/k8s-events.go

+	}
+
+	// Create default annotations
+	// Worth iterating over nodeMetadata fields using reflect? (trutx)


imo it's fine as you've done it

haugenj · 2021-04-22T19:26:19Z

cmd/node-termination-handler.go

 			}
 			metrics.NodeActionsInc("uncordon", nodeName, err)
+			recorder.Emit(observability.Normal, observability.UncordonReason, observability.UncordonMsg)


should this be in an else ?

Good catch, did it everywhere but here. Fixed.

test/e2e/prometheus-metrics-test

haugenj · 2021-04-22T19:39:28Z

config/helm/aws-node-termination-handler/templates/clusterrole.yaml

+  verbs:
+    - create
+    - get
+    - list
+    - patch
+    - update
+    - watch


are all these necessary?

Probably not. Removed them all but create.

cmd/node-termination-handler.go

brycahta

Thanks a lot for the contribution and enhancing the existing code!

cmd/node-termination-handler.go

docs/kubernetes_events.md

pkg/observability/k8s-events.go

test/e2e/spot-interruption-test-events-on

pkg/observability/k8s-events.go

trutx · 2021-04-23T10:47:24Z

🤔 pushed changes to my fork but this PR doesn't sync

newrelic-forks@f6786aa

trutx · 2021-04-23T10:51:03Z

Ok there we go.

🤔 pushed changes to my fork but this PR doesn't sync

newrelic-forks@f6786aa

trutx · 2021-04-23T11:07:16Z

docs/kubernetes_events.md

+
+## Caveats
+
+### Default annotations in Queue Processor Mode


Added this to document that annotations in SQS mode will match those of the Node where ANTH runs and not those of the Node where the interruption, drain or whatever event actually happens.

Not sure if that's the best approach or we should just don't create default, IMDS-based annotations when running in SQS mode. Happy to code that if preferred.

Thanks for the helpful explanations 👍🏼

Similar to @haugenj , I'd like to see some examples of events before making that call. If the annotations introduce a lot of noise in QueueProcessor mode, then maybe we omit them

trutx · 2021-04-23T12:50:10Z

Overall looks pretty good to me, and I appreciate you taking the time to clean up some of the docs and the prometheus test. Would you mind posting an example of how these get rendered with kubectl get events ... ?

I'll post an example and a screenshot when I have something to show. I'm waiting for actual rebalance or interruption events to happen in my test clusters. Probably when the US wake up since clusters live in us-east-2

brycahta · 2021-04-23T14:34:46Z

docs/kubernetes_events.md

+
+## Default annotations
+
+If `emit-kubernetes-events` is enabled, AWS Node Termination Handler will automatically inject a set of annotations to each event it emits. Such annotations are gathered from the underlying host's IMDS endpoint and enrich each event with information about the host that emitted it.


could you elevate your note here about Queue Processor mode?

Ex: Note: In Queue Processor mode, these annotations will reflect the node running NTH not the node receiving the events. See Caveats for more information

Done.

Feel free to edit/correct any documentation wording at will. A native's English will always be better than mine.

brycahta · 2021-04-23T15:49:29Z

docs/kubernetes_events.md

+
+## Caveats
+
+### Default annotations in Queue Processor Mode


Thanks for the helpful explanations 👍🏼

Similar to @haugenj , I'd like to see some examples of events before making that call. If the annotations introduce a lot of noise in QueueProcessor mode, then maybe we omit them

trutx · 2021-04-23T21:14:50Z

I just realised how relevant it is to have the instance lifecycle available so I added it.

Still waiting for a real interruption or recommendation event to happen. Problem is events TTL in k8s is 1 hour by default unless modified and it's too much effort for me to change it in a ~550 node cluster so I guess we'll have to wait and be lucky enough to see it before it expires.

trutx · 2021-04-26T15:13:55Z

pkg/monitor/sqsevent/ec2-state-change-event.go

@@ -75,9 +75,6 @@ func (m SQSMonitor) ec2StateChangeToInterruptionEvent(event EventBridgeEvent, me
 		InstanceID:           ec2StateChangeDetail.InstanceID,
 		Description:          fmt.Sprintf("EC2 State Change event received. Instance went into %s at %s \n", ec2StateChangeDetail.State, event.getTime()),
 	}
-	if err != nil {


Adding this error handling here made the ec2-state-change-sqs-test test fail. Not sure if swallowing this error is the way to go but this is out of the scope of this PR. I'm just replacing err with _ for the sake of the linter but IMO this should be revisited.

Valid also for pkg/monitor/sqsevent/rebalance-recommendation-event.go and pkg/monitor/sqsevent/spot-itn-event.go

trutx · 2021-04-26T15:47:41Z

Here's an anonimized example of a rebalance recommendation that just happened:

$ k get ev --field-selector source=aws-node-termination-handler
LAST SEEN   TYPE     REASON                    OBJECT                                            MESSAGE
13m         Normal   RebalanceRecommendation   node/ip-10-1-2-3.us-east-2.compute.internal   Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z
13m         Normal   PreDrain                  node/ip-10-1-2-3.us-east-2.compute.internal   Pre-drain task successfully executed
13m         Normal   CordonAndDrain            node/ip-10-1-2-3.us-east-2.compute.internal   Node successfully cordoned and drained

$ k get ev --field-selector source=aws-node-termination-handler -o wide
LAST SEEN   TYPE     REASON                    OBJECT                                            SUBOBJECT   SOURCE                                                                     MESSAGE                                                                                FIRST SEEN   COUNT   NAME
13m         Normal   RebalanceRecommendation   node/ip-10-1-2-3.us-east-2.compute.internal               aws-node-termination-handler, ip-10-1-2-3.us-east-2.compute.internal   Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z   13m          1       ip-10-1-2-3.us-east-2.compute.internal.1679710f4148ba6a
13m         Normal   PreDrain                  node/ip-10-1-2-3.us-east-2.compute.internal               aws-node-termination-handler, ip-10-1-2-3.us-east-2.compute.internal   Pre-drain task successfully executed                                                   13m          1       ip-10-1-2-3.us-east-2.compute.internal.1679710f48cde7e6
13m         Normal   CordonAndDrain            node/ip-10-1-2-3.us-east-2.compute.internal               aws-node-termination-handler, ip-10-1-2-3.us-east-2.compute.internal   Node successfully cordoned and drained                                                 13m          1       ip-10-1-2-3.us-east-2.compute.internal.167971124e6d316c

$ k get ev --field-selector source=aws-node-termination-handler,reason=RebalanceRecommendation -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2021-04-26T15:10:51Z",
            "involvedObject": {
                "kind": "Node",
                "name": "ip-10-1-2-3.us-east-2.compute.internal",
                "namespace": "default"
            },
            "kind": "Event",
            "lastTimestamp": "2021-04-26T15:10:51Z",
            "message": "Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z \n",
            "metadata": {
                "annotations": {
                    "account-id": "123456789012",
                    "availability-zone": "us-east-2a",
                    "instance-id": "i-abcdef12345678901",
                    "instance-life-cycle": "spot",
                    "instance-type": "m5.8xlarge",
                    "local-hostname": "ip-10-1-2-3.us-east-2.compute.internal",
                    "local-ipv4": "10.1.2.3",
                    "public-hostname": "",
                    "public-ipv4": "",
                    "region": "us-east-2"
                },
                "creationTimestamp": "2021-04-26T15:10:51Z",
                "name": "ip-10-1-2-3.us-east-2.compute.internal.1679710f4148ba6a",
                "namespace": "default",
                "resourceVersion": "1488622620",
                "selfLink": "/api/v1/namespaces/default/events/ip-10-1-2-3.us-east-2.compute.internal.1679710f4148ba6a",
                "uid": "1adcd988-2e9b-4422-bc54-c89982d46d43"
            },
            "reason": "RebalanceRecommendation",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "aws-node-termination-handler",
                "host": "ip-10-1-2-3.us-east-2.compute.internal"
            },
            "type": "Normal"
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

trutx · 2021-04-26T16:04:53Z

And here's how a few days worth of events in a couple of clusters look like in New Relic One, combined with cluster autoscaler and node events.

All "by AZ", "by instance type", "by account ID" filters are based on the annotations attached to each event, meaning you could set any arbitrary annotations you might need via the extra annotations parameter and then use them to filter events.

brycahta

Thanks for the updates + sample. After seeing the sample, could you make 3 minor updates? After, it'll be good from my side 👍

"message": "Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z \n",

Could you update the Description for sqs-events to contain the instanceID to avoid confusion? This prob should've been done when first adding these events.

The code links are as follows:

Rebalance Recommendation

ex: fmt.Sprintf("Rebalance recommendation event received. Instance %s will be cordoned at %s \n", rebalanceRecDetail.InstanceID, event.getTime())

To close out on the other open question, I am good with leaving annotations for both imds+sqs mode.

trutx · 2021-04-26T16:51:43Z

Done. Still struggling with go-report-card-test though, funnily enough reducing cyclomatic complexity made the grade go down to 93.9% and now tests don't pass because of that.

brycahta

LGTM 👍🏼

trutx · 2021-04-27T09:48:31Z

Here's the output of two consecutive make go-report-card-test with not a single change in between:

$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 6.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            2.9s
 => => transferring dockerfile: 37B                                                                                                                                             0.1s
 => [internal] load .dockerignore                                                                                                                                               0.1s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.8s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:17:48 ERROR: (misspell) exit status 2
Grade: A+ (99.7%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 100%
misspell: 0%
$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 0.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            0.0s
 => => transferring dockerfile: 37B                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                               0.0s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.0s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:22:51 ERROR: (misspell) exit status 2
2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2
Grade: A+ (93.9%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 0%
misspell: 0%
❌ Test failed to meet go-report-card threshold score of: 98
make: *** [go-report-card-test] Error 1

Tried that several times and looks totally random to me, and caused by a third party software anyway.

Sometimes we get the 2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2 error, then we get ineffassign: 0% which causes the overall grade to be Grade: A+ (93.9%), which is below 98 so the test fails. Some other times we don't get the error, ineffassign is at 100% and the test just passes.

brycahta · 2021-04-27T15:03:47Z

Here's the output of two consecutive make go-report-card-test with not a single change in between:

$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 6.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            2.9s
 => => transferring dockerfile: 37B                                                                                                                                             0.1s
 => [internal] load .dockerignore                                                                                                                                               0.1s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.8s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:17:48 ERROR: (misspell) exit status 2
Grade: A+ (99.7%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 100%
misspell: 0%
$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 0.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            0.0s
 => => transferring dockerfile: 37B                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                               0.0s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.0s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:22:51 ERROR: (misspell) exit status 2
2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2
Grade: A+ (93.9%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 0%
misspell: 0%
❌ Test failed to meet go-report-card threshold score of: 98
make: *** [go-report-card-test] Error 1

Tried that several times and looks totally random to me, and caused by a third party software anyway.

Sometimes we get the 2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2 error, then we get ineffassign: 0% which causes the overall grade to be Grade: A+ (93.9%), which is below 98 so the test fails. Some other times we don't get the error, ineffassign is at 100% and the test just passes.

Interesting-- thanks for looking into this. I'm good merging the PR after @haugenj reviews and we can look into the fix.

trutx · 2021-04-27T15:17:25Z

Spent some time trying to reproduce the error but I wasn't able to come to a consistent ending, in my laptop I can't get it neither to always success nor to always fail. What looks consistent is the outcome of the Github CI, always fails.

There have been changes in https://github.com/gojp/goreportcard recently, might be related despite I've tried to pin the module to an older commit and results were still unpredictable.

haugenj · 2021-04-27T15:03:01Z

go.mod

@@ -4,6 +4,7 @@ go 1.15

 require (
 	github.com/aws/aws-sdk-go v1.33.1
+	github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect


@brycahta do we need to add to the third party licenses for this?

Yeah, nice catch.

Should be added here following similar format

haugenj · 2021-04-27T15:07:55Z

pkg/config/config.go

-	case "error":
-	default:
-		return config, fmt.Errorf("Invalid log-level passed: %s  Should be one of: info, debug, error", config.LogLevel)
+	if err := validateLogLevel(strings.ToLower(config.LogLevel)); err != nil {


I don't think this adds value as a separate function, I'd prefer you revert this

It doesn't add much value, I just did it to reduce the cyclomatic complexity, otherwise go-report-card-test would fail.

I say revert it and don't worry about the report card test, we'll take care of that afterwards

cmd/node-termination-handler.go

haugenj · 2021-04-27T15:21:25Z

docs/kubernetes_events.md

+
+Default annotations values are gathered from the IMDS endpoint local to the Node on which AWS Node Termination Handler runs. This is fine when running on IMDS Processor Mode since an AWS Node Termination Handler Pod will be deployed to all Nodes via a `DaemonSet` and each Node will emit all events related to itself with its own default annotations.
+
+However, when running in Queue Processor Mode AWS Node Termination Handler is deployed to a number of Nodes (1 replica by default) via a `Deployment`. In that case the default annotations values will be gathered from the Node(s) running AWS Node Termination Handler, and so the values in the default annotations stamped to all events will match those of the Node from which the event was emitted, not those of the Node of which the event is about.


Are the annotations worth anything in this case? It seems both misleading and unhelpful to print values for a different node than the event actually affected. Perhaps we should just omit the default annotations when running in queue processor mode, thoughts?

I'm trying to think if you would want to know the metadata of the node running nth in queue processor mode. If not, it looks like the host instance running NTH is included in the event here, so we could omit:

"source": { "component": "aws-node-termination-handler", "host": "ip-10-1-2-3.us-east-2.compute.internal" }

It seems both misleading and unhelpful to print values for a different node than the event actually affected

Also agree, but the latest commit should address this. Now, the messaging shown in the even will include the instanceId being affected:

"message": "Rebalance recommendation received. Instance **<instance-id>** will be cordoned at 2021-04-26T15:10:48Z \n",

User provided annotations will still be worthy (think of annotations like cluster name, cluster owner, environment, etc) but data coming from IMDS might not make much sense and I agree it can be confusing, hence the notes in the doc. Probably the right move is to omit annotations with data coming from IMDS when in SQS mode.

Last commit should do it. Also updated the doc to reflect these changes.

haugenj

👍 LGTM, thanks!

imuqtadir · 2021-05-03T21:33:05Z

FYI, this PR reverted some of the changes for #388

spkane · 2021-05-26T20:33:26Z

@trutx Nice work on this. I was just looking for this feature and am glad to see that you have made it happen!

trutx · 2021-05-27T08:20:39Z

@trutx Nice work on this. I was just looking for this feature and am glad to see that you have made it happen!

Stay tuned, Pod events coming soon! cc/ @dthorsen

trutx commented Apr 15, 2021

View reviewed changes

feat: ability to emit Kubernetes events

ebf005e

test: add retry loop, cosmetics

9c8f499

test: fix prometheus metrics retry loop

e9a58f7

trutx commented Apr 16, 2021

View reviewed changes

test/e2e/prometheus-metrics-test Show resolved Hide resolved

chore: rename to UnknownInterruption

f6c239f

chore: add default set of annotations from IMDS

28ceed2

haugenj self-requested a review April 22, 2021 14:40

haugenj reviewed Apr 22, 2021

View reviewed changes

brycahta reviewed Apr 22, 2021

View reviewed changes

chore: changes per @haugenj and @brycahta requests

d7cc644

trutx commented Apr 23, 2021

View reviewed changes

brycahta reviewed Apr 23, 2021

View reviewed changes

Roger Torrentsgeneros added 2 commits April 23, 2021 22:30

docs: elevate note about queue processor mode

e4796ab

chore: add instance-life-cycle

a1b2358

Roger Torrentsgeneros added 4 commits April 26, 2021 00:00

chore: instance lifecycle is not in identity doc

33680e0

chore: appease goreportcard

4fdaf5d

chore: revert and swallow error again

b8c17d8

refactor: reduce cyclomatic complexity

b23a977

trutx commented Apr 26, 2021

View reviewed changes

brycahta suggested changes Apr 26, 2021

View reviewed changes

brycahta previously approved these changes Apr 26, 2021

View reviewed changes

trutx dismissed brycahta’s stale review via 59d021d April 27, 2021 10:12

chore: add instance ID to SQS events descriptions

2124b70

haugenj reviewed Apr 27, 2021

View reviewed changes

Roger Torrentsgeneros added 4 commits April 27, 2021 22:41

revert: validateLogLevel()

8c0a2c2

chore: update licenses

a08ac0b

chore: default annotations in IMDS mode only

b381503

chore: update docs about IMDS annotations

3fcce5f

haugenj approved these changes Apr 28, 2021

View reviewed changes

haugenj merged commit d698755 into aws:main Apr 28, 2021

This was referenced May 3, 2021

Add running pods (soon to be evicted) to webhook notifications #388

Merged

Unable to get pod names in webhook notification. #423

Closed

imuqtadir mentioned this pull request May 11, 2021

Populate pods list to drain event #428

Merged

ec2-bot mentioned this pull request Jun 8, 2021

🥳 node-termination-handler v1.13.1 Automated Release! 🥑 aws/eks-charts#535

Merged

trutx mentioned this pull request Oct 21, 2022

chore: add annotations to events in SQS mode #715

Merged


		## Default annotations

		If `emit-kubernetes-events` is enabled, AWS Node Termination Handler will automatically inject a set of annotations to each event it emits. Such annotations are gathered from the underlying host's IMDS endpoint and enrich each event with information about the host that emitted it.


		Default annotations values are gathered from the IMDS endpoint local to the Node on which AWS Node Termination Handler runs. This is fine when running on IMDS Processor Mode since an AWS Node Termination Handler Pod will be deployed to all Nodes via a `DaemonSet` and each Node will emit all events related to itself with its own default annotations.

		However, when running in Queue Processor Mode AWS Node Termination Handler is deployed to a number of Nodes (1 replica by default) via a `Deployment`. In that case the default annotations values will be gathered from the Node(s) running AWS Node Termination Handler, and so the values in the default annotations stamped to all events will match those of the Node from which the event was emitted, not those of the Node of which the event is about.

feat: ability to emit Kubernetes events #411

feat: ability to emit Kubernetes events #411

Conversation

trutx commented Apr 15, 2021 • edited Loading

trutx Apr 15, 2021 • edited Loading

Choose a reason for hiding this comment

trutx commented Apr 15, 2021

trutx commented Apr 16, 2021 • edited Loading

trutx commented Apr 21, 2021

haugenj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trutx Apr 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brycahta left a comment

Choose a reason for hiding this comment

trutx commented Apr 23, 2021 • edited Loading

trutx commented Apr 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trutx commented Apr 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trutx commented Apr 23, 2021

Choose a reason for hiding this comment

trutx commented Apr 26, 2021

trutx commented Apr 26, 2021 • edited Loading

brycahta left a comment

Choose a reason for hiding this comment

trutx commented Apr 26, 2021

brycahta left a comment

Choose a reason for hiding this comment

trutx commented Apr 27, 2021

brycahta commented Apr 27, 2021

trutx commented Apr 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haugenj left a comment

Choose a reason for hiding this comment

imuqtadir commented May 3, 2021

spkane commented May 26, 2021

trutx commented May 27, 2021

trutx commented Apr 15, 2021 •

edited

Loading

trutx Apr 15, 2021 •

edited

Loading

trutx commented Apr 16, 2021 •

edited

Loading

trutx Apr 23, 2021 •

edited

Loading

trutx commented Apr 23, 2021 •

edited

Loading

trutx commented Apr 26, 2021 •

edited

Loading