Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default workload identity always uses nomadproject.io audience #25079

Closed
shoeffner opened this issue Feb 10, 2025 · 7 comments
Closed

Default workload identity always uses nomadproject.io audience #25079

shoeffner opened this issue Feb 10, 2025 · 7 comments

Comments

@shoeffner
Copy link

shoeffner commented Feb 10, 2025

Nomad version

Nomad v1.8.2
BuildDate 2024-07-26T12:22:15Z
Revision 919bd4e7602ed1c6e26e865186be6a51f5dc33e1
(this is a custom build, hence the date and revision will not match the official release, out custom patches are unrelated though)

Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f
(this is the official release)

Operating system and Environment details

Ubuntu 22 (1.8) and 24 (1.9)

Issue

We are currently transitioning away from the Vault token integration to workload identities.
It seems the default audience (vault > default_identity > aud in the nomad server config) is not taken into account when the workload_identity tokens are generated, instead, the default nomadproject.io is used.
Specifying an audience for a specific identity works as expected.

Reproduction steps

Run a server with a config containing the following vault block:

vault {
  enabled = true

  # old token config -- for the transition, we support both
  create_from_role = "nomad-cluster"
  allow_unauthenticated = false
  address = "https://our-vault.local.example.com"
  
  # new identity config
  default_identity = {
    aud = ["our audience"]
    ttl = "1h"
  }
}

To test the token, we can run this job:

job "token-test" {
    datacenters = ["*"]
    type = "batch"

    task "spawn" {
        driver = "docker"

        config {
            image      = "busybox"
            entrypoint = [""]
            command    = "/bin/sh"
            args       = ["-c", "echo $NOMAD_TOKEN | cut -d. -f 2 | base64 -d; sleep 5"]
        }

        identity {
            env = true
        }

        restart {
            attempts = 0
        }
    }

    reschedule {
        attempts = 0
    }
}

To see the job-override in action, modify the job by extending the identity and updating the variable name:

job "token-test" {
    datacenters = ["*"]
    type = "batch"

    task "spawn" {
        driver = "docker"

        config {
            image      = "busybox"
            entrypoint = [""]
            command    = "/bin/sh"
            args       = ["-c", "echo $NOMAD_TOKEN_test | cut -d. -f 2 | base64 -d; sleep 5"]
        }

        identity {
            name = "test"
            aud = ["our audience"]
            env = true
        }

        restart {
            attempts = 0
        }
    }

    reschedule {
        attempts = 0
    }
}

Expected Result

Both jobs should have this audience claim at the beginning of the output:

{"aud":"our audience",...

Actual Result

The first job (using the default-identity) has the default audience:

{"aud":"nomadproject.io",...

The second job (using the test-identity) has the correct audience:

{"aud":"our audience",...

Job file (if appropriate)

See above.

It could be that we are configuring something wrong, but we don't know what else could have gone wrong other than the server config not being used.
If you need any more information about our setup or if I can assist in reproducing the issue, please let me know!

@tgross
Copy link
Member

tgross commented Feb 10, 2025

Hi @shoeffner! The vault.default_identity says:

Specifies the default workload identity configuration to use when a task with a vault block does not specify an identity block named vault_<name>

(Emphasis added in bold.) Nomad doesn't mint workload identities for Vault at all unless you have a vault block in jobspec. Using the word "default" here is probably steering you in the wrong direction. What it's really saying is "this is what you get, if you get one".

So therefore:

  • In your first jobspec, you're exposing the Nomad default workload identity to the task environment via identity { env = true } but never minting a workload identity suitable for Vault, much less getting a Vault token with it.
  • In the second jobspec you're getting 2 WI's: a default one that's unused and not exposed to the workload, and a second one named "test" that's not related to Vault at all.

Change your jobspec to the following (note the empty vault {} block):

job "token-test" {
  datacenters = ["*"]
  type        = "batch"

  task "spawn" {
    driver = "docker"

    config {
      image      = "busybox"
      entrypoint = [""]
      command    = "/bin/sh"
      args       = ["-c", "echo $NOMAD_TOKEN | cut -d. -f 2 | base64 -d; sleep 5"]
    }

    vault {}

    identity {
      env = true
    }

    restart {
      attempts = 0
    }
  }

  reschedule {
    attempts = 0
  }
}

That'll allow Nomad to mint a Vault token for you using the audience you've specified. It'll be exposed to the workload as the VAULT_TOKEN environment variable. If instead you want to expose the workload identity token itself (so you can login to Vault manually? I don't recommend this unless you're ready to deal with token refreshes), you will end up needing a env = true or file = true in that server configuration.

(Oh also, you're 1.8.2 hosts won't have the vault.default_identity config at all! That was added in 1.8.4)

@shoeffner
Copy link
Author

Thanks for your thorough answer, @tgross !
I was indeed confused due to the naming/wording, I guess – and I am sorry I didn't notice the default_identity did not work at all on 1.8.2, that's because my two examples "to test it" behaved the same on both clusters, but as I see now they were not really good examples as I missed the vault block.

Now my current understanding how it works with identity {}, vault {} and the server configuration vault > default_identity is (assuming the "default" names and only one specification):

  • neither identity nor vault specified in the job: use nomad default identity; does not allow to expose any tokens to the job
  • identity but not vault specified in the job: use nomad default identity with overrides; allows to expose NOMAD_TOKEN (=jwt) to job
  • vault but not identity specified in the job: use default identity from the vault configuration of the server; allows to expose NOMAD_TOKEN (=jwt), VAULT_TOKEN to job
  • vault and identity specified in the job: use default identity from the vault configuration of the server with overrides; allows to expose NOMAD_TOKEN (=jwt), VAULT_TOKEN to job

So this clears up quite a bit of confusion, thanks!
Out of curiosity, is there a way to still change the aud on the nomad default identity then, other than overriding it via the identity block? I think at least showing aud = "our nomad cluster" would then also make it clear that this is a token for our nomad cluster, not for nomadproject.io.

Thanks again for your answer, it helped me a lot! Feel free to close the issue again, as this then works as intended.

Maybe a bit on the background (as you asked if we would plan to login in manually):

As I initially stated, we are migrating away from Vault tokens to workload identities.

We don't plan on manually authenticating with Vault, even though it would be a workaround to indirectly provide namespace-level workload policies (see also hashicorp/terraform-provider-nomad#500) – we decided that it's not worth the effort for a handful of jobs which would benefit from that.

In a web service to schedule jobs from predefined templates (think of nomad-pack in Web UI form with some job progress monitoring), the jobs it schedules (see also #24663) report back via an API when certain events happen. We create the API credentials and pass them via Vault (with a template block); I am looking into simplifying this by allowing the service to validate workload identities instead, but here overriding the aud on the job level seems to be a good choice. While having a look into this to understand how the tokens work, I realized the issue I reported here.

We offered access to user-specified secrets in Vault so users wouldn't have to share their personal GitLab tokens etc. This is no longer possible with workload identities (and was only possible because we built around Vault's inflexibility in the first place, see hashicorp/vault#16183 – OIDC mostly resolved this, though) because we cannot know all job names users come up with in advance. One way around this is to create a namespace for each user, but for now we will simply remove that feature.

Apart from those three things – which are arguably a little bit astray from Nomad's ideal use-cases (we have user-specified, arbitrary workloads vs. Nomad's preferred well-defined, user-unaware workloads) – so far the workload identities seem to make everything smoother and our pilot users love the ease of use to get access to CSIs etc., it's just a little rough for the non-default things we do :)

@tgross
Copy link
Member

tgross commented Feb 11, 2025

Now my current understanding how it works with identity {}, vault {} and the server configuration vault > default_identity is (assuming the "default" names and only one specification):

Yeah that's all right. To be precise, Nomad always creates identity { name = "default" } (for use with the Task API), and if you have a vault block it also creates an implicity identity { name = "vault_default" }, and uses that WI to mint a Vault token from Vault.

Out of curiosity, is there a way to still change the aud on the nomad default identity then, other than overriding it via the identity block? I think at least showing aud = "our nomad cluster" would then also make it clear that this is a token for our nomad cluster, not for nomadproject.io.

There is no way to do that currently. Here's some of the rationale for how that works from an internal design document (NMD-174 for fellow Hashi folks):

However if the audience claim is set, Nomad will only accept the “nomadproject.io” audience. This ensures that if a workload identity with audience=”foo” is sent to a relying party, that relying party cannot use it to access Nomad’s API and impersonate the originating workload.

That is, each identity should have a single use, and the default identity's intended use is accessing the Nomad API. We could make the value configurable, I suppose, but there doesn't seem to be a strong use case of that and it introduces configuration management problems (servers could have different values configured and then not be able to validate WI's signed by the leader).

We offered access to user-specified secrets in Vault so users wouldn't have to share their personal GitLab tokens etc. This is no longer possible with workload identities (and was only possible because we built around Vault's inflexibility in the first place, see hashicorp/vault#16183 – OIDC mostly resolved this, though) because we cannot know all job names users come up with in advance. One way around this is to create a namespace for each user, but for now we will simply remove that feature.

Oof! Yeah, namespaces really are the smallest tenancy primitive in Nomad in terms of access to Vault. Even without workload identity, I suspect your users could read each others secrets by grabbing them from each other's jobs via nomad alloc exec (unless you disable that), but that may be fine for your security model. But that sort of thing is part of why we moved away from job-author-supplied Vault/Consul tokens.

I'm going to close this issue as resolved. But if you've got more thoughts on this and how we can try to improve your intended workflow, we're definitely open to hearing them!

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Feb 11, 2025
@shoeffner
Copy link
Author

Thanks again for your reply! I have a couple of thoughts on the aud below.

unless you disable [nomad alloc exec]

Yes, which is precisely what we do in "public" namespaces because of that.

it introduces configuration management problems

Isn't this already the case for other things, e.g., the gossip key? Maybe it's not a big deal as that would cause failures early at startup time, rather than unexpectedly at runtime.

This ensures that if a workload identity with audience=”foo” is sent to a relying party, that relying party cannot use it to access Nomad’s API and impersonate the originating workload.

I don't really understand this line of reasoning. While I get the idea that this should prevent abuse (and it is worded with "if ... with audience="foo""), it seems very easy to forget to explicitly specify the audience.

If I used identity { env = true } and then passed my NOMAD_TOKEN to some other "relying party", e.g., copied it to the command line, I could impersonate the workload just the same with aud = ["nomadproject.io"].

Proof of concept (with a job running echo $NOMAD_TOKEN; sleep 1000):

nomad agent -dev -acl-enabled > /dev/null &
NOMAD_PID=$!
sleep 2  # wait for nomad to startup
export NOMAD_TOKEN=$(nomad acl bootstrap -json | jq -r .SecretID)
sleep 1 # wait until keyring is ready
nomad var put nomad/jobs/token-test a=b
ALLOC=$(nomad job run token-test.hcl | grep Allocation | cut -d'"' -f 2)
sleep 5 # wait until the job runs and produces the output
export NOMAD_TOKEN=$(nomad alloc logs $ALLOC)
nomad var get nomad/jobs/token-test
pkill $NOMAD_PID

So another party which gets the token is able to use it (while the job was running) to retrieve a secret for that job, which is ... kind of by design for such tokens?

Wouldn't this [the argument in NMD-174] actually be an argument for configuring an expected audience which is only set automatically if no identity block is specified, but explicitly not set if an identity block is specified, precisely to prevent this kind of unintended credential forwarding?

For example, if my nomad cluster only accepts tokens for the audience "my-nomad-cluster.io", it could silently use that to setup the job, write secrets from variables, etc. if a user does not request access to the identity via file or env to the token – the risk of it being distributed by the job is relatively low this way (unless I am missing something).
But as soon as a job requests access to an identity token, I would not include that audience by default, but a user would have to explicitly specify "my-nomad-cluster.io" as the audience (or at least something like identity { nomad_audience = true }, so that a user does not necessarily have to know my configured audience). Otherwise we never know what the user intends to do with the token. I guess the audience might even be conditionally required (required iff nomad_audience is false, which should be the default).
For example, GitLab's CI/CD requires an aud parameter for their id_tokens, and I think it makes sense after reading through your quote from NMD-174; again, I might miss a detail or two here.

But for us, this is not a big issue right now, my initial confusion is clear. Thanks again :)

@tgross
Copy link
Member

tgross commented Feb 11, 2025

Isn't this already the case for other things, e.g., the gossip key? Maybe it's not a big deal as that would cause failures early at startup time, rather than unexpectedly at runtime.

Yes! And as you note, without the right gossip key the server won't join the cluster properly. For other configuration values that aren't so fatal, we still have some of them in config files but... honestly we kinda hate that. 😀 Because it makes it unfortunately easy for cluster admins to break things in subtle ways that they don't notice until ex. a leader election. Over time you'll find we're trying to make more cluster configuration something we distribute via Raft instead. (But there are always bits that'll be impossible to do that with.)

Wouldn't this [the argument in NMD-174] actually be an argument for configuring an expected audience which is only set automatically if no identity block is specified, but explicitly not set if an identity block is specified, precisely to prevent this kind of unintended credential forwarding?

I think the argument is that it prevents a token that's being passed intentionally to a third party for non-Nomad API uses from masquerading as the workload itself if the third party leaks it. A better example than your use case might be using WI as an authorizer for AWS IAM (ex. I want my workload to be able to login to AWS to upload stuff to S3). Here the expected audience is set by the third party and it'll never accept the default WI w/ aud=nomadproject.io that the workload can use for Nomad's API. That becomes a guardrail against handing out the default WI to third parties accidentally -- the JWT you use for AWS IAM won't allow AWS to access the Nomad API. So it really doesn't matter what the default aud value is, so long as it's not expected to be used by anything other than Nomad.

@shoeffner
Copy link
Author

Here the expected audience is set by the third party and it'll never accept the default WI w/ aud=nomadproject.io that the workload can use for Nomad's API.

And yet, this is where that token could still be leaked, as Nomad shifts the responsibility to reject the token to the third party; the simplest way being in logging some error message which accidentally contains "Invalid token: ". Oops.
Or your external party does not check the aud and just let's it through. (Because there's a wildcard, or at least the signature is fine, or whatnot).

I think even for the intentional case, at least the token which gets exposed to Nomad should not include that audience unless explicitly requested, because you could forget to specify the audience.

So I am currently thinking about two things "mostly unrelated" things:

The first, not important thing: making nomadproject.io a configurable option -- if only to distinguish, e.g., different Nomad clusters, or even ensuring that different nomad clusters which have bugs do not accept tokens from others, which is... mostly hypothetical.

The second is to not expose the default audience to the job unless explicitly requested, to prevent intentional or unintentional passing to a third party. If I want AWS to control Nomad for me, fine, then I can put the nomadproject.io audience in there -- but I will not accidentally send that token there, even though I intended to send some token there.
Again, I don't think it's super necessary, it's just what comes to my mind with the current behavior. Right now the burden not to pass a token with nomadproject.io in the audience is on the user. I think "inverting" it, making it a burden to pass a token with nomadproject.io explicitly would be a slightly more secure -- or at least less error-prone and less surprising behavior.

Still, thanks for your point of view, much appreciated that you took the time!

@tgross
Copy link
Member

tgross commented Feb 11, 2025

The second is to not expose the default audience to the job unless explicitly requested,

That's already the case; unless you have identity { env = true } or identity { file = true }.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants