-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up certificate leases #15359
Clean up certificate leases #15359
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #15359 +/- ##
==========================================
+ Coverage 84.59% 84.62% +0.02%
==========================================
Files 219 219
Lines 13584 13584
==========================================
+ Hits 11492 11496 +4
+ Misses 1726 1724 -2
+ Partials 366 364 -2 ☔ View full report in Codecov by Sentry. |
/test istio-latest-no-mesh |
/test all |
/test istio-latest-no-mesh |
@dprotaso could you merge if no objection? |
@@ -108,6 +108,7 @@ restart_pod "${SYSTEM_NAMESPACE}" "app=activator" | |||
# we need to restart the pod to stop the net-certmanager-controller | |||
if (( ! HTTPS )); then | |||
restart_pod "${SYSTEM_NAMESPACE}" "app=controller" | |||
kubectl get leases -n "${SYSTEM_NAMESPACE}" -o json | jq -r '.items[] | select(.metadata.name | test("controller.knative.dev.serving.pkg.reconciler.certificate.reconciler")).metadata.name' | xargs kubectl delete lease -n "${SYSTEM_NAMESPACE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we need to delete the lease. Shouldn't the controller just become the owner of these leases after some duration?
note we also have wait_for_leader_controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dprotaso Not sure if we are on the same page here. The specific lease for the certificate (I am not deleting all leases) should have no reconciler as owner when we disable internal encryption. So when we restart, after we disable that feature, there should be no owner. There should be no owner if implementation for lease management works as expected. However, sometimes K8s go client does not remove the previous owner succesfully.
As I state in the k8s client go bug, I opened recently, when the controller shuts down (btw we have multiple restarts due to chaos) it does not always clean up stuff. Pls check #15321 (comment) for the details on how this happens in a recent CI run. If you want to reproduce this locally (you need to have multiple buckets thus many leases per reconciller), check https://github.com/skonto/test-k8s (README there shows exactly this).
The tests in Serving fail because they compare the previous leader set with the new and they find the certificate lease's old owner when they do the intersection of the two sets. As described above this is due to the certificate lease not being updated with the empty holder identity when controller shuts down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in Serving fail because they compare the previous leader set with the new and they find the certificate lease's old owner when they do the intersection of the two sets. As described above this is due to the certificate lease not being updated with the empty holder identity when controller shuts down.
Cool got it - that's what I was missing - can you add that as a comment explaining that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check https://github.com/skonto/test-k8s (README there shows exactly this).
FWIW playing with this it seems like client-side throttling is the culprit - if you disable it then the updates go through.
In our shared main we increase the QPS and Burst based on the number of reconcilers we have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the repo I have you can see:
E0610 15:07:15.076863 1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "lease-70": the object has been modified; please apply your changes to the latest version and try again
I am refering to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dprotaso added the comment in the description.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso, skonto The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@skonto do you think we need to revert any of your earlier work regarding this flake? |
Fixes #15238
Proposed Changes