-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix panic in nova-controller if cell0 DB creation is late #356
Fix panic in nova-controller if cell0 DB creation is late #356
Conversation
Skipping CI for Draft Pull Request. |
78da53a
to
7394e43
Compare
// We cannot merge this test as is because if the tests run | ||
// in parallel then this test hangs forever (not even the global | ||
// ginkgo --timeout stops it). If you run it with --procs 1 locally | ||
// then it shows the panic and terminates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ginkgo 2.9.4 might have a fix for this hang onsi/ginkgo#1192 And I'm trying bump nova-operator to that version in #361
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum interesting that kind of sucks but ok
@@ -758,4 +758,59 @@ var _ = Describe("Nova multi cell", func() { | |||
) | |||
}) | |||
}) | |||
When("cell1 DB and MQ create finishes before cell0 DB create", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice so I would be tempted to put this in a regression test module but we didn't actually have a bug filed for this.
for this instance I'm ok with leave this here but how would you feel about introducing the two commit workflows and a regression module into the nova-operator like we have in nova.
in general.
// We cannot merge this test as is because if the tests run | ||
// in parallel then this test hangs forever (not even the global | ||
// ginkgo --timeout stops it). If you run it with --procs 1 locally | ||
// then it shows the panic and terminates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hum interesting that kind of sucks but ok
This comment was marked as resolved.
This comment was marked as resolved.
7394e43
to
12ff016
Compare
We saw panic during cell creation in the nova-controller intermittently: goroutine 882 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 +0x1fa panic({0x1814fe0, 0x295e110}) /usr/local/go/src/runtime/panic.go:884 +0x212 github.com/openstack-k8s-operators/nova-operator/controllers.(*NovaReconciler).Reconcile(0xc000386cd0, {0x1ccf298, 0xc001d5a120}, {{{0xc001cbb4e0?, 0x10?}, {0xc001cbb4f0?, 0x40da67?}}}) /remote-source/controllers/nova_controller.go:390 +0x3162 After chasing this for a while I figured that the panic happens when both cell1 DB and MQ creation finishes before the cell0 DB creation. So this patch adds a test case that reproduces the panic in a stable way.
There was a bug in the nova-controller in the ensureCell loop. If the cell0 was still waiting for the DB to be created but cell1 had all the prerequisites are done (DB, MQ) then the code tried to look up cell0 which was not in the internal cell map causing panic.
12ff016
to
342b263
Compare
keystoneAPI := th.GetKeystoneAPI(keystoneAPIName) | ||
keystoneAPI.Status.APIEndpoints["internal"] = "http://keystone-internal-openstack.testing" | ||
Eventually(func(g Gomega) { | ||
g.Expect(k8sClient.Status().Update(ctx, keystoneAPI.DeepCopy())).Should(Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for some other places like this, we don't do DeepCopy(). Would be nice to align those getters interfaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the DeepCopy here. Probably a copy paste error. We would need the DeepCopy though if there would be other things inside the Eventually that depends on the keystoneAPI variable. As far as I understand the Update() call will store the result of the update in the object passed to it.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bogdando, gibizer, SeanMooney The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test functional |
@bogdando: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test functional |
We saw panic during cell creation in the nova-controller intermittently:
After chasing this for a while I figured that the panic happens when both cell1 DB and MQ creation finishes before the cell0 DB creation. So this patch adds a test case that reproduces the panic in a stable way.
The troubleshooting showed that there was a bug in the nova-controller in the ensureCell loop. If the cell0 was still waiting for the DB to be created but cell1 had all the prerequisites done (DB, MQ) then the code tried to look up cell0 which was not in the internal cell map causing panic.