-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate telemetry configurations + provide safety #235
Comments
I'm kinda torn on whether we should prevent creating objects if the ServiceMonitor CRD is not installed. It won't cause an invalid K8ssandraCluster. If the configuration would result in some sort of broken state, then I would definitely favor the validation. |
If a user is requesting a state that cannot be delivered due to pre-requisites not being fulfilled, I think we should be rejecting the request as a general principle. Imagine a situation where you've bootstrapped a 1000 node cluster only to discover that you configured something wrong and need to do a full rolling restart to rectify it. That might be a bit hypothetical and may not fully apply in this case, but I still think failing fast is the best user experience. I've certainly had trouble as a user when trying to configure things in CassandraDatacenters due to the fact that unknown fields are allowed in that CR. It slows down your debugging loop when there's no webhook to reject invalid requests and you have to wait for Cassandra logging to come up to discover what you've done wrong (which can take 5-10 minutes). |
I was briefly looking at some of the discussion in #482, and it had me wondering again whether this is a good use case for the validating webhook. After further consideration I do not think checking for the ServiceMonitor CRDs across remote clusters is a good use case for the validating webhook for multiple reasons. First, latency is a big concern. Here is what the official Kubernetes docs have say:
We will only be as fast as our slowest connection. Also keep in mind that there could be other webhooks that get invoked as part of the client request. Secondly, Kubernetes is an eventually consistent system. Controllers continually try to make the desired state match the actual state, right? Is the scenario of the ServiceMonitor CRD missing something that could eventually be resolved? The answer of course is yes. Contrast that with something that could never be considered valid. For example, a negative JVM heap size will always be invalid. Next, I question whether it is a good practice to perform checks agains objects other than the one which the client is trying to create/update. Suppose my automation runs For reference here are a couple relevant threads on the Kubernetes slack:
Lastly, I am now of the opinion that we should remove the k8sContexts checks in the webhook and consider how to handle that differently. Maybe a combo of events and status condition. @burmanm wdty? |
Are webhooks called serially? I assume validating webhooks are called in parallel. In either event, I don't really see the relevance? We don't control "other webhooks".
My understanding is that latency will only be added to the K8ssandraCluster type. I don't see this as a problem as I do not envisage thousands on K8ssandra clusters running against a single control plane, this isn't like the Pod type where there might be thousands running. I also don't envisage that the K8ssandraCluster is going to be mutated that frequently.
Only via user intervention to install the Prometheus operator. Given how critical monitoring is, it is important that users are notified that things aren't working so that they can undertake that intervention. It is also important to note that - if Prometheus is not installed - the reconciliation will fall into a loop waiting for it to be ready, which - as well as preventing deployment of Reaper and Stargate - also means that further changes cannot be made to the K8ssandraCluster object.
Kustomize has explicitly dealt with this and applies manifests in the order in which they are referenced in the There shouldn't be any fuzziness regarding what order manifests are applied in, and it is typical that the order needs to be fixed (e.g. you can't install cert-manager after k8ssandra-operator, it must be installed beforehand). Given the criticality of monitoring and the ease with which users might miss deploying Prometheus on a cluster if they have many DCs, I'm convinced that webhooks should be used to validate prometheus installation status if it is at all feasible to do so within reasonable latency budgets. I can't see that you've presented any evidence indicating that this is not possible. |
Also, it is worth noting that ping times between e.g. Dublin -> Adelaide are 280ms and this is one of the longest hops you're ever likely to see. I do not think it likely that (in a healthy cluster whose API server is responding quickly and is not overburdened) this functionality will take webhook latencies above 1s, which you've referenced as your criteria for acceptance. The reality is that we are a multi-cluster project, and doing anything useful is going to incur more latency than we'd normally expect in a single-cluster scenario. In saying that, I'm open to making the API server calls concurrent, although this obviously adds complexity. |
If the user doesn't notice they don't have Prometheus (or similar) installed through other means than K8ssandraCluster unable to being created, that monitoring is not important for them. They're not actually worried in that case if the metrics are deployed at all as those are not going to be scraped or used. This isn't Prometheus operator. The operational capability of Prometheus should be done by the prometheus-operator. Just like we don't monitor that metrics-server is installed or kube-state-metrics, even though they have operational significance to how Kubernetes can be monitored or how functions like HPA work. Or node monitoring. |
Kustomize doesn't maintain state, so comparing it to a operator sounds odd. Kubernetes' data-storage and Kubernetes API server consistency are two different things. While etcd itself is consistent to serial writing (not necessarily parallel reads as it does have multiple linearization choices), the processing pipeline of Kubernetes is not guaranteed to be serially consistent. Thus, the read request can serve timestamp wise out-of-date data, as the read could be answered before the write was even processed or acked. This is just the reality of parallel nodes / parallel processing. This is visible in the semantics definition of Kubernetes API also. Our controller then adds another layer of caching to avoid even calling the API server to make it even "more eventually" consistent when it comes to reads. |
Depends how we view the "client". It's true that user could be writing ClientConfig on parallel and then the deployment of K8ssandraCluster fails to missing K8sContext, however on the other hand, we do not actually verify that the ClientConfigs are real in sense of what's stored in the Kubernetes but how k8ssandra-operator sees them. I do agree that It is problematic if the ClientConfig must be connected before we allow creating the K8ssandraCluster (and this is sort of true at the moment as the ClientConfig can't be read if it can't be connected to - but we only verify this when ClientConfig is generated). There could be ClientConfig webhook API though which could verify this, but at the moment our handling is pretty crude and fragile when it comes to ClientConfigs in general. |
Similar to how we treat Cassandra versions also. One can't deploy Cassandra 4.1 or 3.12 unless one lies in the version and overrides with a image. The first version is probably something we should allow in our regexp now that I think about it.. |
If users have specified telemetry in the manifest then they want telemetry enabled. Your approach here isn't what I'd describe as user friendly. It would be easy to overlook missing metrics from one datacenter if you have 20+, in that case the cluster is in a very dangerous state and cannot be monitored or even reconciled properly.
I do not see the relevance. We are not managing Prometheus resources, we are checking that a dependancy is installed. If operators can't even validate that pre-requisites are installed on a cluster I'm not sure what use they are.
The context for the comment related to about applying manifests in an arbitrary order. The Kubernetes ecosystem does not allow for arbitrarily ordered manifest application in general. We aren't comparing Kustomize to an operator. Can you folks please take conversations about validation of ClientConfigs to a different ticket. This ticket is about validating telemetry configurations. |
I've discussed this with John in a bit of depth over the past couple of days. One path we took was looking at which operators in the community do deep vs shallow validation of incoming resources;
It seems that there are cases where webhooks call back to the API server to do deeper validations. Given that there is still some concern about whether this is good practice we are going to consult internally and place this on hold pending those discussions. |
It seems like we've hit a blocker as we cannot get to a consensus on this PR. |
I am a still a 👎 on doing this validation in the webhook.
|
I believe the concern was that we were going to end up with a positive feedback loop, where
3-4 then loop infinitely, and may in fact end up escalating into a retry storm. This might happen anyway, but if we are reading from the API server as part of each retry we get nastier outcome w.r.t. load. (@burmanm can you confirm this was your concern? You expressed it best.) If we are able to filter out status updates from the webhook's view, then I think that breaks the retry loop outlined above. @jsanda not sure what you're referring to RE the reconcile method returning an error before retrying - are you just restating your opposition to validating via a webhook and your preferred solution? |
When the user has requested ServiceMonitor deployment via the k8ssandra CRs, but Prometheus operator (and therefore the
ServiceMonitor
CRs) has not been installed, the requested resources should not be created and the user should be informed of the error.Ideally, the validation should be implemented in a validating webhook.
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: K8OP-107
The text was updated successfully, but these errors were encountered: