-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition adding/updating listener #11833
Comments
Hi @lkysow we came across the same issue with Consul 1.10.3 We use nomad consul integration, after one particular job deployment (replace one service with another service), the ingress gateway started to spamming the error message:
Had to restart all ingress allocations after the issue happened. |
@lkysow I was able to reliably reproduce:
It seems to me the all clusters (or endpoints, listeners..) updates info from cfg snap should make their way into the xds updates to envoy, since the cfg snap is already scoped to a particular sidecar/gateway. That would effectively mean using wildcard always for handlers. I might be missing something, was there any reason it was not in such way? |
@fredwangwang would you be able to provide a more detailed step by step reproduction with config files etc? I'm unable to reproduce this following your steps locally. |
Hi @lkysow func TestServer_DeltaAggregatedResources_v3_GetAllClusterAfterConsulRestarted(t *testing.T) {
// This illustrates a scenario related to https://github.com/hashicorp/consul/issues/11833
aclResolve := func(id string) (acl.Authorizer, error) {
// Allow all
return acl.RootAuthorizer("manage"), nil
}
scenario := newTestServerDeltaScenario(t, aclResolve, "web-sidecar-proxy", "", 0)
_, mgr, errCh, envoy := scenario.server, scenario.mgr, scenario.errCh, scenario.envoy
sid := structs.NewServiceID("web-sidecar-proxy", nil)
// Register the proxy to create state needed to Watch() on
mgr.RegisterProxy(t, sid)
var snap *proxycfg.ConfigSnapshot
runStep(t, "get into state after consul restarted", func(t *testing.T) {
snap = newTestSnapshot(t, nil, "")
// Send initial cluster discover.
// This is to simulate the discovery request call from envoy after disconnected from consul ads stream.
envoy.SendDeltaReq(t, ClusterType, &envoy_discovery_v3.DeltaDiscoveryRequest{
ResourceNamesSubscribe: []string{
"local_app",
"db.default.dc1.internal.11111111-2222-3333-4444-555555555555.consul",
},
InitialResourceVersions: map[string]string{
"local_app": "a948904f2f0f479b8f8197694b30184b0d2ed1c1cd2a1ec0fb85d299a192a447",
"db.default.dc1.internal.11111111-2222-3333-4444-555555555555.consul": "5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03",
},
})
// Check no response sent yet
assertDeltaChanBlocked(t, envoy.deltaStream.sendCh)
requireProtocolVersionGauge(t, scenario, "v3", 1)
// Deliver a new snapshot
// the config contains 3 clusters: local_app, db, geo-cache.
// this is to simulate the fact that there is one additional (geo-cache upstream) cluster gets added to the sidecar service
// during the time xds disconnected (consul restarted).
mgr.DeliverConfig(t, sid, snap)
assertDeltaResponseSent(t, envoy.deltaStream.sendCh, &envoy_discovery_v3.DeltaDiscoveryResponse{
TypeUrl: ClusterType,
Nonce: hexString(1),
Resources: makeTestResources(t,
makeTestCluster(t, snap, "tcp:local_app"),
makeTestCluster(t, snap, "tcp:db"),
makeTestCluster(t, snap, "tcp:geo-cache"),
),
})
})
envoy.Close()
select {
case err := <-errCh:
require.NoError(t, err)
case <-time.After(50 * time.Millisecond):
t.Fatalf("timed out waiting for handler to finish")
}
} Since we used nomad & consul integration, our setup involves deploying nomad jobs, which is too lengthy to provide in this issue. If you would like to see the issue in action, I think we can arrange some time with @YuZhao-Zhang 's help. Let me know. |
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream, prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field with the full list of currently subscribed items, instead of simply omitting it to infer that it wanted everything (which is what wildcard mode means). This upstream issue was filed in envoyproxy/envoy#16063 and fixed in envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later versions (later refactored in envoyproxy/envoy#16855). This PR conditionally forces LDS/CDS to be wildcard-only even when the connected Envoy requests a non-wildcard subscription, but only does so on versions prior to `1.19.0`, as we should not need to do this on later versions. This fixes the failure case as described here: #11833 (comment) Co-authored-by: Huan Wang <[email protected]>
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream, prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field with the full list of currently subscribed items, instead of simply omitting it to infer that it wanted everything (which is what wildcard mode means). This upstream issue was filed in envoyproxy/envoy#16063 and fixed in envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later versions (later refactored in envoyproxy/envoy#16855). This PR conditionally forces LDS/CDS to be wildcard-only even when the connected Envoy requests a non-wildcard subscription, but only does so on versions prior to `1.19.0`, as we should not need to do this on later versions. This fixes the failure case as described here: #11833 (comment) Co-authored-by: Huan Wang <[email protected]>
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream, prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field with the full list of currently subscribed items, instead of simply omitting it to infer that it wanted everything (which is what wildcard mode means). This upstream issue was filed in envoyproxy/envoy#16063 and fixed in envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later versions (later refactored in envoyproxy/envoy#16855). This PR conditionally forces LDS/CDS to be wildcard-only even when the connected Envoy requests a non-wildcard subscription, but only does so on versions prior to `1.19.0`, as we should not need to do this on later versions. This fixes the failure case as described here: #11833 (comment) Co-authored-by: Huan Wang <[email protected]>
When a wildcard xDS type (LDS/CDS/SRDS) reconnects from a delta xDS stream, prior to envoy `1.19.0` it would populate the `ResourceNamesSubscribe` field with the full list of currently subscribed items, instead of simply omitting it to infer that it wanted everything (which is what wildcard mode means). This upstream issue was filed in envoyproxy/envoy#16063 and fixed in envoyproxy/envoy#16153 which went out in Envoy `1.19.0` and is fixed in later versions (later refactored in envoyproxy/envoy#16855). This PR conditionally forces LDS/CDS to be wildcard-only even when the connected Envoy requests a non-wildcard subscription, but only does so on versions prior to `1.19.0`, as we should not need to do this on later versions. This fixes the failure case as described here: #11833 (comment) Co-authored-by: Huan Wang <[email protected]> Co-authored-by: Huan Wang <[email protected]>
A version of the PR by @fredwangwang was adapted and merged for release in the next set of patch releases of |
@rboyer thanks for getting the fix ready for the next release! |
Overview of the Issue
I can't reproduce this but creating a ticket to see if other people have run into the same issue.
One day I applied a proxy defaults config and my proxy went into this super tight loop where it was spamming errors and Consul was spamming these errors. Thousands of these messages every couple of seconds:
(newlines added for readability)
It looks like a race condition occurred where we were trying to XDS add a listener and expected a cluster to be there when it wasn't.
To fix it I had to restart the proxy.
Consul info for both Client and Server
Consul 1.10.4
The text was updated successfully, but these errors were encountered: