-
Notifications
You must be signed in to change notification settings - Fork 385
tls-init-cleanup can run if pre-install fails #419
Conversation
If the pre-install hooks failed, then no non-hook resources get created. This means that the tls-init-cleanup service account doesn't get created. Then if the user runs helm delete, the tls-init-cleanup job tries to run but never starts because it doesn't have its service account. helm delete then hangs forever. The fix is to ensure that the resources needed by the cleanup job get created even if the pre-install hook failed. To do this, we mark them as part of the pre-delete hook and Helm will ensure they get created. Another snag is that Helm creates the Job before the service account. To fix this, we set the weight of the job to 1 so that it is created after the service account. This is a known helm issue: helm/helm#7447
I think I need to add |
Without this setting, if the tls init job failed, users wouldn't be able to re-run helm install without manually deleting the job.
Updated to add this. |
@lkysow Could you give a bit more details? Are you referring to the helm install? If helm install fails due to tls-init, then the job shouldn't be deleted because Helm will delete it only if the job suceeds. |
It's easiest to show via reproduction:
The solution is to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lkysow this is awesome! Thanks so much for this fix. I originally didn't make serviceaccount, clusterrolebinding, and clusterrole for the tls-init-cleanup job a hook because then it would get deleted immediately after if I'm using hook-succeeded
. But I didn't think of using weights to solve it!
If the pre-install hooks failed, then no non-hook resources get created.
This means that the tls-init-cleanup service account doesn't get
created. Then if the user runs helm delete, the tls-init-cleanup job
tries to run but never starts because it doesn't have its service
account. helm delete then hangs forever.
The fix is to ensure that the resources needed by the cleanup job get
created even if the pre-install hook failed. To do this, we mark them as
part of the pre-delete hook and Helm will ensure they get created.
Another snag is that Helm creates the Job before the service account. To
fix this, we set the weight of the job to 1 so that it is created after
the service account. This is a known helm issue: helm/helm#7447
Also adds
before-hook-creation
to the deletion policy for thetls-init-job
so that if a user does run a helm delete, they can run a helm install and the job will get deleted instead of helm complaining that the job already exists.Fixes #418
Reproduction
To reproduce, you need to cause a failure in the pre-install hooks. An easy way to do this is to set the
tls-init-job
's service account to one that doesn't exist:Run
helm install
with a values file that hastls.enabled=true
. The helm install will hang forever and eventually time out. Ctrl-C it so you don't have to wait.Then run
helm delete
. You'll see thetls-init-cleanup
job is created but it never completes. If you describe the job it'll say its service account doesn't exist. The helm delete will hang forever. If you Ctrl-C it and then try to runhelm install
you'll get an error that the release already exists. You have to manually delete the helm secret to fix the issue (or manually create the cleanup job's service account).Now check out this branch and perform the same steps (you'll need to make sure you delete the init job's service account and job first). The
helm delete
should succeed.