Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up certificate leases #15359

Merged
merged 1 commit into from
Jul 11, 2024
Merged

Clean up certificate leases #15359

merged 1 commit into from
Jul 11, 2024

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Jun 26, 2024

Fixes #15238

Proposed Changes

@knative-prow knative-prow bot requested review from izabelacg and ReToCode June 26, 2024 11:41
@knative-prow knative-prow bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 26, 2024
@skonto skonto requested review from dprotaso and removed request for izabelacg and ReToCode June 26, 2024 11:42
@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2024
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.62%. Comparing base (1dff15d) to head (c3984c1).
Report is 27 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15359      +/-   ##
==========================================
+ Coverage   84.59%   84.62%   +0.02%     
==========================================
  Files         219      219              
  Lines       13584    13584              
==========================================
+ Hits        11492    11496       +4     
+ Misses       1726     1724       -2     
+ Partials      366      364       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto
Copy link
Contributor Author

skonto commented Jun 26, 2024

/test istio-latest-no-mesh

@dprotaso
Copy link
Member

/test all

@skonto
Copy link
Contributor Author

skonto commented Jun 26, 2024

/test istio-latest-no-mesh

@skonto
Copy link
Contributor Author

skonto commented Jul 4, 2024

@dprotaso could you merge if no objection?

@@ -108,6 +108,7 @@ restart_pod "${SYSTEM_NAMESPACE}" "app=activator"
# we need to restart the pod to stop the net-certmanager-controller
if (( ! HTTPS )); then
restart_pod "${SYSTEM_NAMESPACE}" "app=controller"
kubectl get leases -n "${SYSTEM_NAMESPACE}" -o json | jq -r '.items[] | select(.metadata.name | test("controller.knative.dev.serving.pkg.reconciler.certificate.reconciler")).metadata.name' | xargs kubectl delete lease -n "${SYSTEM_NAMESPACE}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we need to delete the lease. Shouldn't the controller just become the owner of these leases after some duration?

note we also have wait_for_leader_controller

Copy link
Contributor Author

@skonto skonto Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dprotaso Not sure if we are on the same page here. The specific lease for the certificate (I am not deleting all leases) should have no reconciler as owner when we disable internal encryption. So when we restart, after we disable that feature, there should be no owner. There should be no owner if implementation for lease management works as expected. However, sometimes K8s go client does not remove the previous owner succesfully.
As I state in the k8s client go bug, I opened recently, when the controller shuts down (btw we have multiple restarts due to chaos) it does not always clean up stuff. Pls check #15321 (comment) for the details on how this happens in a recent CI run. If you want to reproduce this locally (you need to have multiple buckets thus many leases per reconciller), check https://github.com/skonto/test-k8s (README there shows exactly this).

The tests in Serving fail because they compare the previous leader set with the new and they find the certificate lease's old owner when they do the intersection of the two sets. As described above this is due to the certificate lease not being updated with the empty holder identity when controller shuts down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in Serving fail because they compare the previous leader set with the new and they find the certificate lease's old owner when they do the intersection of the two sets. As described above this is due to the certificate lease not being updated with the empty holder identity when controller shuts down.

Cool got it - that's what I was missing - can you add that as a comment explaining that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check https://github.com/skonto/test-k8s (README there shows exactly this).

FWIW playing with this it seems like client-side throttling is the culprit - if you disable it then the updates go through.

In our shared main we increase the QPS and Burst based on the number of reconcilers we have

Copy link
Contributor Author

@skonto skonto Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the repo I have you can see:

E0610 15:07:15.076863       1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "lease-70": the object has been modified; please apply your changes to the latest version and try again

I am refering to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dprotaso added the comment in the description.

@dprotaso
Copy link
Member

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Jul 11, 2024
Copy link

knative-prow bot commented Jul 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot merged commit 63a741d into knative:main Jul 11, 2024
65 checks passed
@dprotaso
Copy link
Member

@skonto do you think we need to revert any of your earlier work regarding this flake?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestControllerHA is flakey
2 participants