Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix unknown cluster issue #12160

Closed

Conversation

fredwangwang
Copy link
Contributor

when the envoy sidecar get disconnected from the xds stream, the reconnecting request will
contain InitialResourceVersions and ResourceNamesSubscribe. This is OK for endpoint and route type,
as these two are driven by (child types of) cluster and listener type respectively.

However, register cluster type as subscription instead of wildcard would cause envoy
not able to get any new cluster updates for the rest of this life. Same goes for listener.

This pr is to always set cluster and listener type to wildcard, to ensure the envoy sidecar will get
those updates after disconnecting from xds stream for whatever reason (network blip/consul restart/etc).

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 21, 2022

CLA assistant check
All committers have signed the CLA.

when the envoy sidecar get disconnected from the xds stream, the reconnecting request will
contain `InitialResourceVersions` and `ResourceNamesSubscribe`. This is OK for endpoint and route type,
as these two are driven by (child types of) cluster and listener type respectively.

However, register `cluster` type as subscription instead of wildcard would cause envoy
not able to get any new cluster updates for the rest of this life. Same goes for `listener`.

This pr is to always set cluster and listener type to wildcard, to ensure the envoy sidecar will get
those updates after disconnecting from xds stream for whatever reason (network blip/consul restart/etc).
@fredwangwang
Copy link
Contributor Author

this solves the failure case as described here:
#11833 (comment)

ingress log before:

[2022-01-21 21:20:25.837][1][info][main] [source/server/server.cc:764] starting main dispatch loop
[2022-01-21 21:20:25.850][1][info][upstream] [source/common/upstream/cds_api_helper.cc:28] cds: add 6 cluster(s), remove 0 cluster(s)
[2022-01-21 21:20:27.087][1][info][upstream] [source/common/upstream/cds_api_helper.cc:65] cds: added/updated 6 cluster(s), skipped 0 unmodified cluster(s)
[2022-01-21 21:20:27.087][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:168] cm init: initializing secondary clusters
[2022-01-21 21:20:27.099][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:192] cm init: all clusters initialized
[2022-01-21 21:20:27.099][1][info][main] [source/server/server.cc:745] all clusters initialized. initializing init manager
[2022-01-21 21:20:27.108][1][info][tracing] [source/common/tracing/http_tracer_manager_impl.cc:41] instantiating a new tracer: envoy.tracers.zipkin
[2022-01-21 21:20:27.109][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'http:0.0.0.0:8080'
[2022-01-21 21:20:27.113][1][info][config] [source/server/listener_manager_impl.cc:888] all dependencies initialized. starting workers
### consul restarted
[2022-01-21 21:20:38.555][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 13,
[2022-01-21 21:20:46.588][1][info][upstream] [source/common/upstream/cds_api_helper.cc:28] cds: add 6 cluster(s), remove 0 cluster(s)
[2022-01-21 21:20:47.782][1][info][upstream] [source/common/upstream/cds_api_helper.cc:65] cds: added/updated 6 cluster(s), skipped 0 unmodified cluster(s)
[2022-01-21 21:20:47.786][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'http:0.0.0.0:8080'
### new service added
[2022-01-21 21:21:36.215][1][warning][config] [source/common/config/delta_subscription_state.cc:155] delta config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: route: unknown cluster 'NEW-SERVICE.default.dc1.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul'
[2022-01-21 21:21:36.215][1][warning][config] [source/common/config/grpc_subscription_impl.cc:127] gRPC config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: route: unknown cluster 'NEW-SERVICE.default.dc1.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul'
[2022-01-21 21:21:36.217][1][warning][config] [source/common/config/delta_subscription_state.cc:155] delta config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: route: unknown cluster 'NEW-SERVICE.default.dc1.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul'
[2022-01-21 21:21:36.217][1][warning][config] [source/common/config/grpc_subscription_impl.cc:127] gRPC config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: route: unknown cluster 'NEW-SERVICE.default.dc1.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul'
[2022-01-21 21:21:36.219][1][warning][config] [source/common/config/delta_subscription_state.cc:155] delta config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: route: unknown cluster 'NEW-SERVICE.default.dc1.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul'
[2022-01-21 21:21:36.219][1][warning][config] [source/common/config/grpc_subscription_impl.cc:127] gRPC config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: route: unknown cluster 'NEW-SERVICE.default.dc1.internal.66ab7f8f-f016-483b-e544-975167a3806a.consul'
<keep spamming the same message>

after patching consul:

[2022-01-21 21:55:30.707][1][info][main] [source/server/server.cc:764] starting main dispatch loop
[2022-01-21 21:55:30.737][1][info][upstream] [source/common/upstream/cds_api_helper.cc:28] cds: add 6 cluster(s), remove 0 cluster(s)
[2022-01-21 21:55:32.100][1][info][upstream] [source/common/upstream/cds_api_helper.cc:65] cds: added/updated 6 cluster(s), skipped 0 unmodified cluster(s)
[2022-01-21 21:55:32.100][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:168] cm init: initializing secondary clusters
[2022-01-21 21:55:32.115][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:192] cm init: all clusters initialized
[2022-01-21 21:55:32.115][1][info][main] [source/server/server.cc:745] all clusters initialized. initializing init manager
[2022-01-21 21:55:32.129][1][info][tracing] [source/common/tracing/http_tracer_manager_impl.cc:41] instantiating a new tracer: envoy.tracers.zipkin
[2022-01-21 21:55:32.130][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'http:0.0.0.0:8080'
[2022-01-21 21:55:32.140][1][info][config] [source/server/listener_manager_impl.cc:888] all dependencies initialized. starting workers
### consul restarted
[2022-01-21 21:56:09.496][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 13, 
[2022-01-21 21:56:10.000][1][info][upstream] [source/common/upstream/cds_api_helper.cc:28] cds: add 6 cluster(s), remove 0 cluster(s)
[2022-01-21 21:56:11.308][1][info][upstream] [source/common/upstream/cds_api_helper.cc:65] cds: added/updated 6 cluster(s), skipped 0 unmodified cluster(s)
[2022-01-21 21:56:11.321][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'http:0.0.0.0:8080'
### new service added
[2022-01-21 21:56:42.260][1][info][upstream] [source/common/upstream/cds_api_helper.cc:28] cds: add 1 cluster(s), remove 0 cluster(s)
[2022-01-21 21:56:42.517][1][info][upstream] [source/common/upstream/cds_api_helper.cc:65] cds: added/updated 1 cluster(s), skipped 0 unmodified cluster(s)
[2022-01-21 21:56:57.263][1][warning][config] [source/common/config/grpc_subscription_impl.cc:119] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment

@rboyer
Copy link
Member

rboyer commented Jan 21, 2022

I found this reference to this strange behavior as being non-ideal in envoy: envoyproxy/envoy#16063

@rboyer
Copy link
Member

rboyer commented Jan 21, 2022

And the fix landed here: envoyproxy/envoy#16153

@fredwangwang
Copy link
Contributor Author

thanks @rboyer! do you know which envoy version has it? AFAIK we are on the latest envoy version (v1.18.4) that is compatible with consul 1.10.x, and are still seeing this issue.

@rboyer
Copy link
Member

rboyer commented Jan 21, 2022

It doesn't show up in the changelog, but I scanned for a key line from the initial PR commit and it shows up in

  • v1.19.x
  • v1.20.x
  • v1.21.x

@rboyer
Copy link
Member

rboyer commented Jan 21, 2022

So there's precedent in the Consul codebase for slightly altering xDS behavior based on the connected envoy versions. I'm going to look into bending your changes to that form soon so that we don't have to do something "off spec" for compliant envoy instances going forward.

@fredwangwang
Copy link
Contributor Author

awesome, thank you for looking into this @rboyer !

@rboyer
Copy link
Member

rboyer commented Jan 21, 2022

Thank you for tracking down the shape of this proposed fix, which helped identify what was possibly up over on the envoy side.

@rboyer
Copy link
Member

rboyer commented Jan 21, 2022

For my own notes, the envoy folks have slightly updated the surrounding code from this patch in this later patch: envoyproxy/envoy#16855

@rboyer
Copy link
Member

rboyer commented Jan 24, 2022

@fredwangwang I've made a new PR that replaces this one (and avoids the typo in your branch name) with the conditional behavior. I've carried your test over basically unchanged: #12174

@fredwangwang
Copy link
Contributor Author

great thanks @rboyer! Let me close this pr and keep an eye on urs instead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants