Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680

Closed
DerekStrickland opened this issue Apr 19, 2022 · 1 comment · Fixed by #15068

Comments

@DerekStrickland
Copy link
Contributor

Nomad version

Nomad 1.3.0-beta1

Operating system and Environment details

Nomad Vagrantfile
1 Server
2 Clients

Issue

Creating a job that is constrained to a single node, can result in a second allocation being scheduled for that client being started and then immediately stopped when a disconnected client reconnects

Reproduction steps

  • nomad run the jobfile below that constrains by node.unique.name
  • In a new session, start a watch on the job watch nomad status <jobname>
  • Wait for the deployment to finish
  • Start a watch for the job - you should see something similar to the following
  • Simulate a network outage on the targeted client (e.g. sudo iptables -I INPUT -s 192.168.56.11 -j DROP)
  • Wait for the allocation to transition to the unknown status
  • Simulate a network reconnect for the targeted client (e.g. sudo iptables -D INPUT -s 192.168.56.11 -j DROP)
  • When the allocation transition back to running, notice that a second allocation was started and immediately stopped
  • Notice the original allocation did not stop
Allocations
ID                Node ID      Task Group  Version  Desired  Status       Created         Modified
2f8d72f2    94027db7   cache           0            stop       complete  2s ago            1s ago
043e25bd  94027db7   cache           0            run         running     1m44s ago    2s ago

Expected Result

The original allocation reconnects and no additional allocations are scheduled

Actual Result

An extra allocation was scheduled and then immediately stopped

Job file (if appropriate)

job "no-replace" {
  datacenters = ["dc1"]

  group "cache" {
    count = 1

    max_client_disconnect = "2m"

    constraint {
      attribute = "${node.unique.name}"
      value     = "nomad-client02"
      operator  = "="
    }

    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
@github-actions
Copy link

github-actions bot commented Mar 5, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants