-
Notifications
You must be signed in to change notification settings - Fork 329
Feat/nomad-jobspec canary promote releaser #2938
Feat/nomad-jobspec canary promote releaser #2938
Conversation
…stroyer and Status.
…or this as well as TODO releaser changes.
…n. Update Generation in Nomad jobspec platform to create a new gen ID if the job has canaries, so it is release-able.
Since I started this draft, I worked through implementing the resource manager for the Nomad platform releaser. I also added the ability to supply configurations to the Nomad platform deployer for a Nomad job's Since then, I pivoted to focus on implementing a separate releaser, still for Nomad canaries, but specifically for the In order to permit canaries to work, the Generation implementation had to be tweaked. If a Nomad jobspec being deployed is a canary deployment, it gets a UUID for its generation ID. If it is not, the job's generation ID will be the Nomad job's ID, which is what previously it was always set to. In order to allow a release to occur with a mutable platform deployer, the generation ID cannot be the same as the last deployment. I have been wondering if to support canaries, that perhaps a canary flag within the Generation type could be useful, but that would require some deeper core changes. I'm also looking into implementing the Generation interface for this releaser. Open issues:
Regarding the other open topics from my first comment, I still intend to determine a URL based on a Consul service in the job (specifics TBD), and add support for failing or reverting the canary deployment to the releaser. The happy path at this point is to have a jobspec with canaries configured, run |
At this point, I'm going to leave out scaling and revert functionality from this PR. Promoting or failing a Nomad deployment are the operations most relevant to canaries. I considered adding revert support moreso than scaling, but to fully support the job revert Nomad API requires supplying a Consul (though not yet implemented on the Nomad API side) and Vault token, which would add greater complexity to the release operation being performed by Waypoint. Failing a canary deployment is supported with the Example usage: release {
use "nomad-jobspec-canary" {
groups = [
"test" //only promotes the job taskgroup "test", any others will be ignored (even if they have canaries)
]
fail_deployment = false //if true, fails the active canary deployment
}
} To hopefully wrap this up I'm going to now focus on the release URL, deployment pruning and canary-promotion-before-healthy issues mentioned in my previous two comments. I foresee the Consul URL as the release URL presenting more issues because if the app is using dynamic ports, then the URL with the Consul service name and a port appended to the end wouldn't be possible. In the meantime I may lean on the idea of selecting from the deployed job a random alloc's IP/port combo until the path to integration with Consul here (or with a load balancer that integrates with Consul like Traefik or Fabio) becomes clear. |
Thanks for all the details and keeping us in the loop @paladin-devops - Let us know when you're ready for a team review! For now, we'll give you the space to continue to work on this PR while it's still in draft form :) but wanted to say feel free to ping us when you are feeling like this is ready. |
Thank you for all the help with the Nomad details @tgross, and for taking time to review this while squashing the CSI bugs over there! 😄 & thanks for the 1st pass through @briancain! 😄 I had only one question about your feedback, but I think I got the rest of it covered! |
Looks great @paladin-devops ! You might need to regenerate the website mdx, otherwise you can ignore the Vercel failure right now! I will re-request review for myself so I can give this a proper test 🙏🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! I was able to give this a test run. I haven't got a successful Canary promotion yet, but I got through enough of it to give a better review of the pull request.
And major props adding status to jobspec, this is quite a large pull request with a lot of major features, so kudos! ✨
…ditional processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yoo, thanks for all the updates @paladin-devops - I still can't get my Canary to successfully work, but loving all the updates. I found a bug in the failure scenario for releases, otherwise this looks great to me.
I've asked some others on the team to help test this PR too and see if they can get it to work. If you have any other advice for getting the Canary successful I'm all ears 😄
builtin/nomad/jobspec/releaser.go
Outdated
|
||
rm := r.resourceManager(log, dcr) | ||
if err := rm.CreateAll( | ||
ctx, log, st, &result, target, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty subtle error, but we need to include a StepGroup
parameter here for CreateAll
. The reason being, for when a Release encounters a failure, Waypoint automatically "rolls back" to a previous version. We do this through the plugin SDK, and end up calling DestroyAll
directly rather than the top level Destroy
here. DestroyAll
expects a stepgroup, and without it, we get a bad error:
! 2 errors occurred:
* rpc error: code = Aborted desc = Context cancelled from timeout checking
health of task group "app": context deadline exceeded
* Error during rollback: 1 error occurred:
* argument cannot be satisfied: type: terminal.StepGroup. This is a bug in the
go-argmapper library since this shouldn't happen at this point.
This is basically saying when our internal SDK attempted to call DestroyAll
during a Release failure, the arguments it were given could not satisfy any defined Destroy functions that expected a StepGroup.
For the fix, it is pretty simple:
diff --git a/builtin/nomad/jobspec/releaser.go b/builtin/nomad/jobspec/releaser.go
index 237218fe8..7cb514b10 100644
--- a/builtin/nomad/jobspec/releaser.go
+++ b/builtin/nomad/jobspec/releaser.go
@@ -319,11 +319,12 @@ func (r *Releaser) Release(
// TODO: Replace ui.Status with StepGroups once this bug
// has been fixed: https://github.com/hashicorp/waypoint/issues/1536
st := ui.Status()
+ sg := ui.StepGroup()
defer st.Close()
+ defer st.Wait()
rm := r.resourceManager(log, dcr)
if err := rm.CreateAll(
- ctx, log, st, &result, target,
+ ctx, log, st, sg, &result, target,
); err != nil {
return nil, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brian! Just pushed the update, good catch!
I'm pretty good with the code changes here 👍🏻 I'll give the team a few days to see if others can get it to work on their systems. I assume a Canary release works on yours @paladin-devops ? Is there something specific from the jobspec that I might need? I really only used the // Nomad jobspec file app.nomad.tpl
job "web" {
datacenters = ["dc1"]
group "app" {
update {
max_parallel = 1
canary = 1
auto_revert = true
auto_promote = false
health_check = "task_states"
}
task "app" {
driver = "docker"
config {
image = "${artifact.image}:${artifact.tag}"
// For local Nomad, you prob don't need this on a real deploy
network_mode = "host"
}
env {
%{ for k,v in entrypoint.env ~}
${k} = "${v}"
%{ endfor ~}
// For URL service
PORT = "3000"
}
}
}
} project = "example-nodejs"
config {
env = {
WP_TEST = "test-string"
}
}
app "example-nodejs" {
build {
use "pack" {}
registry {
use "docker" {
image = "nodejs-example"
tag = "1"
local = true
}
}
}
deploy {
use "nomad-jobspec" {
// Templated to perhaps bring in the artifact from a previous
// build/registry, entrypoint env vars, etc.
jobspec = templatefile("${path.app}/app.nomad.tpl")
}
}
release {
use "nomad-jobspec-canary" {
groups = [
"app"
]
}
}
} edit: Also, sorry for all of the extra debug work here to get your PR over the finish line. If you do not have time for this at the moment I understand. |
@briancain I used your From Nomad alloc logs:
So what I'm thinking here is that possibly "network_mode" host is causing this port conflict. I removed that from my template, and it started up! I'd recommend removing that and then trying it out. Happy result (from Nomad alloc logs):
No problem! I know all the enhancements will help my fellow Nomads 😄 |
@paladin-devops - YESSS, that was totally it! Frustrating I could not find that stack trace from the app in the Nomad alloc logs! Thank you, it works now 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this! Looks great, appreciate all of the time spent on feedback and review from us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Let's merge this thing! 🎉 |
This PR introduces a releaser for Nomad, which promotes a canary deployment. Supporting canaries in releases for Nomad bring canary/blue-green deployment strategies into Waypoint's lifecycle (#2817).
This is still very much a work in progress. Although I've successfully promoted a Nomad canary deployment with this plugin, I still intend to fully implement the release manager for the releaser, the destroyer (though this will likely be redundant with the Nomad platform destroyer), and there are other configuration options/changes I would like to include:
This is my first attempt at a plugin beyond the tutorial, and I look forward to any feedback while I continue with this draft!