[patch] Extend custom Application healthcheck to detect Helm chart rendering failures #1201
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By default, an Application's health status will not be affected if ArgoCD fails to render its Helm template (e.g. due to a bad secret reference causing AVP to error).
For example, in the following screenshot the application in syncwave 1 is showing as synced and healthy even though its Helm chart rendering failed. Wave 2 was allowed to proceed even though the resources in wave 1 were not deployed. This would cause problems if the application in wave 2 depended on a resource being configured in wave 1 first:

If we look inside the application in wave 1, we can see that it is in an error state and no resources were deployed:

ArgoCD is working as intended here; the Application in wave 1 is showing as synced, since this is the sync status of the Application CR itself (which synced just fine). It is showing as healthy since the health of a resource in ArgoCD is determined solely by the health of its direct children (and the application in wave 1 has no children since it failed to deploy any resources). The problem is that the intended behaviour means we cannot rely on syncwaves to control deployment ordering when Helm template rendering fails.
There are a number of issues already open against ArgoCD relating to this, but none have come to any sort of conclusion about what should be changed. The most relevant one I have found is this: argoproj/argo-cd#10088.
That includes the idea of adding logic to the custom Application healthcheck to check for the "ComparisonError" condition seen when the helm chart fails to render. This at least prevents ArgoCD from allowing sibling applications in later waves from syncing.
This PR extends the custom Application healthcheck established by

gitops-bootstrap
to set Application health to degraded when this ComparisonError condition is present. I've verified that this works as expected and (crucially) blocks sibling applications in later syncwaves from progressing:https://jsw.ibm.com/browse/MASCORE-3669