Skip to content
This repository was archived by the owner on Jan 8, 2024. It is now read-only.

Feat/nomad-jobspec canary promote releaser #2938

Merged
merged 57 commits into from
Mar 2, 2022
Merged

Feat/nomad-jobspec canary promote releaser #2938

merged 57 commits into from
Mar 2, 2022

Conversation

paladin-devops
Copy link
Contributor

@paladin-devops paladin-devops commented Jan 25, 2022

This PR introduces a releaser for Nomad, which promotes a canary deployment. Supporting canaries in releases for Nomad bring canary/blue-green deployment strategies into Waypoint's lifecycle (#2817).

This is still very much a work in progress. Although I've successfully promoted a Nomad canary deployment with this plugin, I still intend to fully implement the release manager for the releaser, the destroyer (though this will likely be redundant with the Nomad platform destroyer), and there are other configuration options/changes I would like to include:

  • Specify particular Nomad task groups to promote (by default this promotes all)
  • A Consul URL, or IP/port combo if Consul isn't used
  • Add canary option to the Nomad platform plugin
  • Support generation ID changes in releaser with nomad-jobspec platform deployment
  • Evaluate if scaling should be supported
  • Evaluate if failing a canary deployment should be supported
  • Evaluate if reverting to a previous deployment should be supported

This is my first attempt at a plugin beyond the tutorial, and I look forward to any feedback while I continue with this draft!

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 25, 2022

CLA assistant check
All committers have signed the CLA.

@briancain briancain requested a review from a team January 26, 2022 16:37
@paladin-devops
Copy link
Contributor Author

paladin-devops commented Jan 31, 2022

Since I started this draft, I worked through implementing the resource manager for the Nomad platform releaser. I also added the ability to supply configurations to the Nomad platform deployer for a Nomad job's update stanza (now commented out). I removed that change because only after doing that did it occur to me that updating an existing deployment contrasts with Waypoint's opinionated workflow for Nomad, which, in part, is to create a new job upon deployment - not update an existing one. Therefore, I may very well revert all the changes made to the Nomad platform releaser, unless the opinionated workflow changes for Nomad in the direction of supporting a canary workflow using Nomad's built-in features.

Since then, I pivoted to focus on implementing a separate releaser, still for Nomad canaries, but specifically for the nomad-jobspec platform deployer, where much of the work I did for the platform releaser was portable. The implementation differs here in that jobspec is a plugin for mutable deployments and implements the Generation interface. While doing this I also implemented the Resource Manager for nomad-jobspec:
image

In order to permit canaries to work, the Generation implementation had to be tweaked. If a Nomad jobspec being deployed is a canary deployment, it gets a UUID for its generation ID. If it is not, the job's generation ID will be the Nomad job's ID, which is what previously it was always set to. In order to allow a release to occur with a mutable platform deployer, the generation ID cannot be the same as the last deployment. I have been wondering if to support canaries, that perhaps a canary flag within the Generation type could be useful, but that would require some deeper core changes. I'm also looking into implementing the Generation interface for this releaser.

Open issues:

  • waypoint deploy has the flag -release set to true by default; however this almost always will kick off a release before the canaries become healthy, which is a pre-requisite to promotion for Nomad - this must be somehow overridden (if possible) or a delay before the initiation of the release should be considered, possibly by checking the healthy deadline of the update
  • waypoint release prunes deployments by default. Even though the problem of allowing a release is taken care of by the Generation changes mentioned, the release operation itself will actually by default delete the job from which it just promoted the canaries as it prunes all but 1 deployment by default. So by the end of the release operation, the status retrieval which occurs leads to an EOF, as the job no longer exists from which to get a status.

Regarding the other open topics from my first comment, I still intend to determine a URL based on a Consul service in the job (specifics TBD), and add support for failing or reverting the canary deployment to the releaser.

The happy path at this point is to have a jobspec with canaries configured, run waypoint deploy -release=false (run it a second time if it's your first time deploying the job - the "update" stanza in your job does nothing if it is the very first deployment of the job), and then run waypoint release -prune=false.

CLI:
image

UI:
image

@paladin-devops paladin-devops changed the title (WIP) Feat/nomad canary promote releaser (WIP) Feat/nomad-jobspec canary promote releaser Feb 1, 2022
@paladin-devops
Copy link
Contributor Author

At this point, I'm going to leave out scaling and revert functionality from this PR. Promoting or failing a Nomad deployment are the operations most relevant to canaries. I considered adding revert support moreso than scaling, but to fully support the job revert Nomad API requires supplying a Consul (though not yet implemented on the Nomad API side) and Vault token, which would add greater complexity to the release operation being performed by Waypoint.

Failing a canary deployment is supported with the fail_deployment option, and promoting a specific group is supported with the groups config option.

Example usage:

  release {
    use "nomad-jobspec-canary" {
      groups = [
        "test" //only promotes the job taskgroup "test", any others will be ignored (even if they have canaries)
      ]
      fail_deployment = false //if true, fails the active canary deployment
    }
  }

To hopefully wrap this up I'm going to now focus on the release URL, deployment pruning and canary-promotion-before-healthy issues mentioned in my previous two comments. I foresee the Consul URL as the release URL presenting more issues because if the app is using dynamic ports, then the URL with the Consul service name and a port appended to the end wouldn't be possible. In the meantime I may lean on the idea of selecting from the deployed job a random alloc's IP/port combo until the path to integration with Consul here (or with a load balancer that integrates with Consul like Traefik or Fabio) becomes clear.

@briancain
Copy link
Member

Thanks for all the details and keeping us in the loop @paladin-devops - Let us know when you're ready for a team review! For now, we'll give you the space to continue to work on this PR while it's still in draft form :) but wanted to say feel free to ping us when you are feeling like this is ready.

@paladin-devops
Copy link
Contributor Author

Thank you for all the help with the Nomad details @tgross, and for taking time to review this while squashing the CSI bugs over there! 😄

& thanks for the 1st pass through @briancain! 😄 I had only one question about your feedback, but I think I got the rest of it covered!

@briancain
Copy link
Member

Looks great @paladin-devops ! You might need to regenerate the website mdx, otherwise you can ignore the Vercel failure right now! I will re-request review for myself so I can give this a proper test 🙏🏻

@briancain briancain self-requested a review February 23, 2022 22:53
Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I was able to give this a test run. I haven't got a successful Canary promotion yet, but I got through enough of it to give a better review of the pull request.

And major props adding status to jobspec, this is quite a large pull request with a lot of major features, so kudos! ✨

Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yoo, thanks for all the updates @paladin-devops - I still can't get my Canary to successfully work, but loving all the updates. I found a bug in the failure scenario for releases, otherwise this looks great to me.

I've asked some others on the team to help test this PR too and see if they can get it to work. If you have any other advice for getting the Canary successful I'm all ears 😄


rm := r.resourceManager(log, dcr)
if err := rm.CreateAll(
ctx, log, st, &result, target,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty subtle error, but we need to include a StepGroup parameter here for CreateAll. The reason being, for when a Release encounters a failure, Waypoint automatically "rolls back" to a previous version. We do this through the plugin SDK, and end up calling DestroyAll directly rather than the top level Destroy here. DestroyAll expects a stepgroup, and without it, we get a bad error:

! 2 errors occurred:
        * rpc error: code = Aborted desc = Context cancelled from timeout checking
  health of task group "app": context deadline exceeded
        * Error during rollback: 1 error occurred:
        * argument cannot be satisfied: type: terminal.StepGroup. This is a bug in the
  go-argmapper library since this shouldn't happen at this point.

This is basically saying when our internal SDK attempted to call DestroyAll during a Release failure, the arguments it were given could not satisfy any defined Destroy functions that expected a StepGroup.

For the fix, it is pretty simple:

diff --git a/builtin/nomad/jobspec/releaser.go b/builtin/nomad/jobspec/releaser.go
index 237218fe8..7cb514b10 100644
--- a/builtin/nomad/jobspec/releaser.go
+++ b/builtin/nomad/jobspec/releaser.go
@@ -319,11 +319,12 @@ func (r *Releaser) Release(
        // TODO: Replace ui.Status with StepGroups once this bug
        // has been fixed: https://github.com/hashicorp/waypoint/issues/1536
        st := ui.Status()
+       sg := ui.StepGroup()
        defer st.Close()
+       defer st.Wait()

        rm := r.resourceManager(log, dcr)
        if err := rm.CreateAll(
-               ctx, log, st, &result, target,
+               ctx, log, st, sg, &result, target,
        ); err != nil {
                return nil, err
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brian! Just pushed the update, good catch!

@briancain
Copy link
Member

briancain commented Feb 28, 2022

I'm pretty good with the code changes here 👍🏻 I'll give the team a few days to see if others can get it to work on their systems.

I assume a Canary release works on yours @paladin-devops ? Is there something specific from the jobspec that I might need? I really only used the update stanza:

// Nomad jobspec file app.nomad.tpl
job "web" {
  datacenters = ["dc1"]
  group "app" {
    update {
      max_parallel = 1
      canary       = 1
      auto_revert  = true
      auto_promote = false
      health_check = "task_states"
    }

    task "app" {
      driver = "docker"
      config {
        image = "${artifact.image}:${artifact.tag}"

        // For local Nomad, you prob don't need this on a real deploy
        network_mode = "host"
      }

      env {
        %{ for k,v in entrypoint.env ~}
        ${k} = "${v}"
        %{ endfor ~}

        // For URL service
        PORT = "3000"
      }
    }
  }
}
project = "example-nodejs"

config {
  env = {
    WP_TEST = "test-string"
  }
}

app "example-nodejs" {
  build {
    use "pack" {}
    registry {
      use "docker" {
        image = "nodejs-example"
        tag   = "1"
        local = true
      }
    }
  }

  deploy {
    use "nomad-jobspec" {
      // Templated to perhaps bring in the artifact from a previous
      // build/registry, entrypoint env vars, etc.
      jobspec = templatefile("${path.app}/app.nomad.tpl")
    }
  }

  release {
    use "nomad-jobspec-canary" {
      groups = [
        "app"
      ]
    }
  }
}

edit: Also, sorry for all of the extra debug work here to get your PR over the finish line. If you do not have time for this at the moment I understand.

@paladin-devops
Copy link
Contributor Author

I assume a Canary release works on yours @paladin-devops ? Is there something specific from the jobspec that I might need?

@briancain I used your waypoint.hcl file and Nomad jobspec template file on my system (Linux amd64, Debian based), along with the code from the waypoint-examples repo (I assumed that's what your pack build was using), and I was also unable to deploy it! The Docker containers were unhealthy. Here's what I ran into:

From Nomad alloc logs:

events.js:291
      throw er; // Unhandled 'error' event
      ^

Error: listen EADDRINUSE: address already in use :::3000
    at Server.setupListenHandle [as _listen2] (net.js:1316:16)
    at listenInCluster (net.js:1364:12)
    at Server.listen (net.js:1450:7)
    at Function.listen (/workspace/node_modules/express/lib/application.js:618:24)
    at Object.<anonymous> (/workspace/index.js:10:4)
    at Module._compile (internal/modules/cjs/loader.js:999:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1027:10)
    at Module.load (internal/modules/cjs/loader.js:863:32)
    at Function.Module._load (internal/modules/cjs/loader.js:708:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:60:12)
Emitted 'error' event on Server instance at:
    at emitErrorNT (net.js:1343:8)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 'EADDRINUSE',
  errno: 'EADDRINUSE',
  syscall: 'listen',
  address: '::',
  port: 3000
}

So what I'm thinking here is that possibly "network_mode" host is causing this port conflict. I removed that from my template, and it started up! I'd recommend removing that and then trying it out.

Happy result (from Nomad alloc logs):

> [email protected] start /workspace
> node index.js

Listening on 3000

edit: Also, sorry for all of the extra debug work here to get your PR over the finish line. If you do not have time for this at the moment I understand.

No problem! I know all the enhancements will help my fellow Nomads 😄

@briancain
Copy link
Member

@paladin-devops - YESSS, that was totally it! Frustrating I could not find that stack trace from the app in the Nomad alloc logs! Thank you, it works now 🎉

Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this! Looks great, appreciate all of the time spent on feedback and review from us.

@briancain briancain requested a review from tgross March 2, 2022 18:36
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@briancain
Copy link
Member

Let's merge this thing! 🎉

@briancain briancain merged commit 16d072d into hashicorp:main Mar 2, 2022
@paladin-devops paladin-devops deleted the feat/nomad-promote-releaser branch March 2, 2022 22:54
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants