Do not report problems until kube-apiserver is ready #295

yguo0905 · 2019-06-18T00:22:17Z

In #288, we changed NPD to run custom plugins on startup. I hoped this would allow NPD to always report an event immediately when the cluster is just created, no matter how big the invoke_internal is.

However, this will not always work due to its interaction with kube-apiserver. What I observed during cluster creation was below.

NPD started and invoked the custom plugin immediately, and then sent an event to kube-apiserver.
The event was failed to be sent because kube-apiserver was not running yet. The event library will retry sending the event.
Unable to write event: 'Post https://x.x.x.x/api/v1/namespaces/default/events: dial tcp 3 4.68.6.201:443: connect: connection refused' (may retry after sleeping)
kube-apiserver started.
The event was re-sent to kube-apiserver but was rejected this time without further retry because of a permission error:
events is forbidden: User "system:node-problem-detector" cannot create resource "events" in API group "" in the namespace "default"' (will not retry!)
https://github.com/kubernetes/kubernetes/blob/c8b45cd25c18e65798dde49fc7011495ea6021d5/cluster/gce/gci/configure-helper.sh#L568 was called to set up the permission.

There is a small window between (3) and (5) - if the event is rejected during that interval the event will never be resent again.

Changing the event library to always retry on permission error may or may not make sense. But what we can do in NPD is to introduce a configurable initial_delay for custom plugins. In this case, I can configure it to 1m with invoke_internal still being 6h. The plugin will run after 1m when the NPD starts.

/cc @wangzhen127 @Random-Liu

The text was updated successfully, but these errors were encountered:

yguo0905 · 2019-07-08T21:59:14Z

We've discussed 3 ways to solve the problem.

Add a configurable timeout option to K8s exporter. On startup, NPD will NOT run any plugins (and be blocked in K8s export creation) until either apiserver is ready or the timeout occurs.
- NPD metrics pipeline will be unnecessarily blocked for the timeout duration if NPD cannot connect to apiserver.
- This doesn't solve the issue where some plugin must run with an initial delay, which is irrelevant to apiserver's availability.
Similar to (1) but, instead of blocking NPD, we accumulate the events in k8s exporter.
- Not easy to implement - we need to think about how to store the events and send them in a batch (considering QPS) when apiserver becomes ready.
- This doesn't solve the issue where some plugin must run with an initial delay, which is irrelevant to apiserver's availability.
Support a configurable initial delay in custom plugins. Instead of solving the problem in the exporter, we deal it at the plugin side.
- This doesn't work for built-in plugins, but we can extend it in the future if needed.
- Solve both problems, simple to implement, will not affect metrics pipeline.

I prefer (3), which is the easy way to solve the problem without any side effects on the existing behavior.

yguo0905 · 2019-07-08T22:36:26Z

Posting the comments from @Random-Liu in our offline discussion.

I have no concern about option 3, and I think the initial delay is something we can support if it is needed in some use cases.

However, I feel like we should not use it to solve the apiserver initial connection problem, because the problem is not specific to any plugin. It is not conceptually correct to use a per-plugin config option to work around that problem, which it is a hack to me. If possible, I prefer we solve the problem in the k8s exporter with either option 1 or 2.

As for the health monitor, if it needs the initial delay, we can add it as well, but that should not be used to solve the apiserver initial connection problem.

I will follow the advice and go with option (1).

yguo0905 changed the title ~~Support initial delay for custom plugins~~ Do not report problems until kube-apiserver is ready Jul 9, 2019

yguo0905 mentioned this issue Jul 9, 2019

Support waiting for kube-apiserver to be ready with timout during NPD startup #308

Merged

k8s-ci-robot closed this as completed in #308 Jul 9, 2019

yguo0905 mentioned this issue Jul 15, 2019

Cherry pick #308 to v0.6: Support waiting for kube-apiserver to be ready with timout during NPD startup #312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not report problems until kube-apiserver is ready #295

Do not report problems until kube-apiserver is ready #295

yguo0905 commented Jun 18, 2019 •

edited

Loading

yguo0905 commented Jul 8, 2019

yguo0905 commented Jul 8, 2019

Do not report problems until kube-apiserver is ready #295

Do not report problems until kube-apiserver is ready #295

Comments

yguo0905 commented Jun 18, 2019 • edited Loading

yguo0905 commented Jul 8, 2019

yguo0905 commented Jul 8, 2019

yguo0905 commented Jun 18, 2019 •

edited

Loading