Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul default token not utilized for Nomad non-native connect jobs #11073

Closed
johnalotoski opened this issue Aug 23, 2021 · 9 comments
Closed
Assignees
Labels

Comments

@johnalotoski
Copy link

johnalotoski commented Aug 23, 2021

Nomad version

Nomad v1.1.3

Operating system and Environment details

NixOS 21.05, Docker driver jobs
Consul v1.10.1 (patched with Consul PR 9639 and 10714 to fix websockets and upstream listener issues)

Issue

Due to nomad issue 9813 we specifically avoid passing a consul token into nomad client config and instead make use of the consul default token which is utilized when the nomad client consul token is not supplied. This approach is working for us without issue everywhere else at the moment (AFAIK) except for Connect jobs. When no consul token is supplied to the nomad client, the connect envoy sidecar creation code does not utilize the default Consul token to create a connect service instance token and the envoy_bootstrap.json file that is created has an empty string being passed for the x-consul-token instead of a connect service instance token:

    <...snip...>
    "ads_config": {
      "api_type": "DELTA_GRPC",
      "transport_api_version": "V3",
      "grpc_services": {
        "initial_metadata": [
          {
            "key": "x-consul-token",
            "value": ""
          }
        ],
        "envoy_grpc": {
          "cluster_name": "local_agent"
        }
      }
    }
    <...snip...>

This results in errors in the envoy container stderr logs and consul agent logs of the following:

# From consul agent on the host (log level is trace):
agent.envoy.xds: Incremental xDS v3: xdsVersion=v3 direction=request protobuf="{ "typeUrl": "type.googleapis.com/envoy.config.cluster.v3.Cluster"
agent.envoy.xds: subscribing to type: xdsVersion=v3 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster
agent.envoy.xds: watching proxy, pending initial proxycfg snapshot for xDS: service_id=_nomad-task-6227f408-bee9-77fa-529f-924164f42b80-group-api-count-api-9001-sidecar-proxy xdsVersion=v3
agent.envoy.xds: Got initial config snapshot: service_id=_nomad-task-6227f408-bee9-77fa-529f-924164f42b80-group-api-count-api-9001-sidecar-proxy xdsVersion=v3
agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = PermissionDenied desc = permission denied"

# From envoy stderr in the envoy sidecar container (log level is trace):
DeltaAggregatedResources gRPC config stream closed: 7, permission denied
gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed
gRPC update for type.googleapis.com/envoy.config.listener.v3.Listener failed

However, if the default Consul token is provided to the nomad client, then it will (assuming the default token has sufficient ACLs) properly create a connect mesh service instance token and place it in the envoy_bootstrap.json file and the Connect mesh job will work as expected.

Reproduction steps

  • Ensure the default Consul token has sufficient ACLs to at least create a Connect service instance token, example:
{
  "node_prefix": {
    "": {
      "intentions": "deny",
      "policy": "read"
    }
  },
  "service_prefix": {
    "": {
      "intentions": "deny",
      "policy": "write"
    }
  }
}
  • Ensure intentions are already allowed for test, or otherwise modify the ACL example shown above as needed.
  • Don't supply a default token to the nomad client config.
  • Try launching a connect job, the log errors mentioned above should be seen.

Expected Result

  • For the consul default token to be utilized when no Consul token is supplied to nomad client config, and a proper connect service identity token to be created from that consul default token and populated into the envoy_bootstrap.json file.

Actual Result

  • The failure errors shown above.

Job file (if appropriate)

  • An example job of injecting a token of sufficient ACL directly into the envoy_bootstrap.json file so the connect job will work, without having to supply the consul default token to the nomad client directly and cause the problems mentioned in nomad issue 9813 is seen here.
  • Note that supplying the consul default token instead to the connect job through CONSUL_HTTP_TOKEN env var at the job, task or sidecar_task env stanza level did not work in getting a x-consul-token populated into the envoy_bootstrap.json file.
@DerekStrickland
Copy link
Contributor

DerekStrickland commented Aug 30, 2021

Hi @johnalotoski ,

Thanks for using Nomad. We're looking into this, and discussing internally. Thanks submitting the issue, and stay tuned for updates.

In the meantime, do you have any relevant Nix config you can share?

@johnalotoski
Copy link
Author

Thanks @DerekStrickland! Is there any Nix config in particular you are looking for? There isn't much relevant that I'm aware of at the moment. The Consul and Nomad configs for this testing are provided here and here respectively, which are likely more relevant.

@johnalotoski
Copy link
Author

johnalotoski commented Aug 30, 2021

Looks like this Consul bug that's been patched now may be related? hashicorp/consul#10824.
(Oh, I've already applied this patch, which was mentioned in Consul issue 10714 and it didn't seem to solve the issue.)

@DerekStrickland
Copy link
Contributor

Thanks @johnalotoski. I know almost nothing about Nix, so you caught the key word... relevant. If you don't think there is any relevant Nix config, I'll have to trust you.

That said, I did a little digging yesterday, and I too quickly found myself in Consul code. I'll see if I can get some eyes from the folks on that team to help troubleshoot.

@DerekStrickland
Copy link
Contributor

@johnalotoski Just to confirm. You're confident you're running Consul 1.10.2?

@johnalotoski
Copy link
Author

Hi @DerekStrickland, no I'm currently running Consul 1.10.1 with patches from Consul PR 9639 and for issue 10714 mentioned above. I opened this ticket about 5 days before 1.10.2 was released. I'll get the cluster bumped to 1.10.2 in the next day or two and report back shortly.

@johnalotoski
Copy link
Author

Hi again @DerekStrickland. I bumped the cluster to Consul 1.10.2 today, still at Nomad v1.1.3 -- will bump Nomad to v1.1.4 next week. But in any case, indeed, bumping to Consul 1.10.2 has resolved this issue. Connect jobs now get a connect service instance token from the default token properly. Thank you!

@DerekStrickland
Copy link
Contributor

That's great to hear! I'll go ahead and close this issue then. Thanks again for using Nomad.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants