Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI does not support ability to Start/Restart failed Allocation and tasks #9881

Closed
scyd-cb opened this issue Jan 25, 2021 · 3 comments
Closed

Comments

@scyd-cb
Copy link

scyd-cb commented Jan 25, 2021

Nomad version

Output from nomad version
Nomad v1.0.1

Operating system and Environment details

CentOS 8

Issue

While allocation and task are in running state there is a way to stop/restart allocation or restart task from the UI. However when allocation is in failed state (max attempts have been reached), there is no way to start/restart a failed allocation and task from the UI. only work around is to mark nomad client as ineligible and toggle back to eligible to restart allocation or to stop/start the whole job.

Reproduction steps

Run job with a failing task, let allocation failed after max attempt of restart have been reached. allocation will be in failed state, with no start/restart options from the UI.

Our intention is to use Nomad to manage or services using raw_exec driver as a replacement of systemd.

Job file (if appropriate)

here is an example of the job file we are using with some generic names
`job "myJob" {
datacenters = ["dc1"]
type = "system"

group "myGroup" {
constraint {
attribute = "${meta.nodeId}"
value = "node1"
}

restart {
  attempts = 3
  delay    = "15s"
  interval = "24h"
  mode     = "fail"
}

task "service1" {
  user = "user1"
  driver = "raw_exec"

  config {
    command = "/bin/service1"
  }
}

task "service2" {
  user = "user1"
  driver = "raw_exec"

  config {
    command = "/bin/service2"
  }
}

}
}`

screenshot of Running alloc/Failing alloc
nomad_success
nomad failed

screenshot of Running / Failing task
nomad_task_success
nomad_task_failed

@DingoEatingFuzz
Copy link
Contributor

DingoEatingFuzz commented Jan 27, 2021

I am definitely seeing what you're seeing, but I think this is by design. If you attempt this same workflow from the CLI, you'll get the error message Unexpected response code: 500 (Task not running).

I suspect the original design is that if a task isn't running, it shouldn't be resurrected like this. Once a task is terminal, it is always terminal and the scheduler is free to use these resources elsewhere. Side-stepping this axiom has implications for rescheduling behavior, preemption behavior, and generally scheduling.

To avoid this, you have two options:

  1. Set reschedule rules which will tell Nomad to attempt running an alloc on a different node after restart attempts are exhausted.
  2. Stop/start the whole job from the UI which will go through the full scheduler process of finding capacity in your cluster and creating new allocations.

This isn't exactly my area of expertise, so I want to verify that this is indeed the intended design before closing.

@DingoEatingFuzz
Copy link
Contributor

Alright, after chatting with @cgbaker, this is indeed not a supported use case. I hope these two alternative options help you out. If it still feels like something is missing please feel free to express the workflow you're looking for here.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants