-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kds): introduce zone health checks #7821
feat(kds): introduce zone health checks #7821
Conversation
d2211af
to
cb39b5e
Compare
deaad86
to
cbab23a
Compare
8cdb4c6
to
0f817ea
Compare
Signed-off-by: Mike Beaumont <[email protected]>
81e8212
to
4ed9872
Compare
Signed-off-by: Mike Beaumont <[email protected]>
Signed-off-by: Mike Beaumont <[email protected]>
4ed9872
to
4aa9128
Compare
Signed-off-by: Mike Beaumont <[email protected]>
7b6ae86
to
848d7aa
Compare
Signed-off-by: Mike Beaumont <[email protected]>
Signed-off-by: Mike Beaumont <[email protected]>
848d7aa
to
8693e31
Compare
Signed-off-by: Mike Beaumont <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could introduce tests at least for ZoneWatch. Any ideas for E2E test?
Signed-off-by: Mike Beaumont <[email protected]>
I think e2e will be somewhat difficult because of the nature of isolating failure to the app level health check. The always possible, but ugly option is to introduce an option like "don't send health check pings" to the CP but IMO we should avoid this artificial, public test interface @slonka mentioned potentially sending SIGSTOP to make the process stop sending pings but keep TCP connections open. The problem that I see is that HTTP2 PING ACKs also stop, which will trigger the gRPC keep alive to fail. |
I agree. We could put Kong in the middle (we have test infra to deploy KIC) which apparently does not proxy PING/GOAWAY and then do SIGKILL on Zone CP. I wonder if this would work. |
It should work 👍 |
OK, if we have some infrastructure for starting up kong gateway it becomes significantly less work |
Signed-off-by: Mike Beaumont <[email protected]>
8087fbe
to
8cae7a8
Compare
Signed-off-by: Mike Beaumont <[email protected]>
…thcheck Signed-off-by: Mike Beaumont <[email protected]>
8cae7a8
to
fd543c8
Compare
Signed-off-by: Mike Beaumont <[email protected]>
Signed-off-by: Mike Beaumont <[email protected]>
Signed-off-by: Mike Beaumont <[email protected]>
… timing out Signed-off-by: Mike Beaumont <[email protected]>
Signed-off-by: Mike Beaumont <[email protected]>
…8017) Revert "feat(kuma-cp): introduce zone health checks (#7821)" This reverts commit baa72b6. --- Signed-off-by: slonka <[email protected]>
This reverts commit 386ab53. Signed-off-by: Mike Beaumont <[email protected]>
* feat(kuma-cp): reintroduce zone health checks (#7821) This reverts commit 386ab53. * feat: don't error in zone if global CP doesn't support health check Signed-off-by: Mike Beaumont <[email protected]>
Explanation
Zone:
Global:
ZoneWatch
that listens forZoneOpenedStream
and marks the(tenantID, zone)
for watching(tenantID, zone)
and if the time of last health check inZoneInsight
is too late, sendZoneWentOffline
ZoneWentOffline
events and the handler returns if said event is received, ending the streamWe store the info in ZoneInsight because:
All instances need to potentially kill streams but not every instance will receive a health check from connected zones
Tests
The need for
time.Sleep
in the tests comes about because it happens asynchronously that:ZoneWatch
subscribes toZoneOpenedStream
events inStart
in reality, this is guaranteed to happen before ZoneOpenedStream events are sent by the fact that we only send them in response to new gRPC streams being opened
ZoneOpenedStream
is witnessed byZoneWatch
the test adds a
time.Sleep
because we only want to update the health check time once and in particular after the zone starts being watched and then check that it's disconnectedin reality, a zone will continually send its health check ping so it eventually will be updated after the initial seen last time.
Todo
Checklist prior to review
syscall.Mkfifo
have equivalent implementation on the other OS --UPGRADE.md
? --> Changelog:
entry here or add aci/
label to run fewer/more tests?