disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680

DerekStrickland · 2022-04-19T19:15:14Z

Nomad version

Nomad 1.3.0-beta1

Operating system and Environment details

Nomad Vagrantfile
1 Server
2 Clients

Issue

Creating a job that is constrained to a single node, can result in a second allocation being scheduled for that client being started and then immediately stopped when a disconnected client reconnects

Reproduction steps

nomad run the jobfile below that constrains by node.unique.name
In a new session, start a watch on the job watch nomad status <jobname>
Wait for the deployment to finish
Start a watch for the job - you should see something similar to the following
Simulate a network outage on the targeted client (e.g. sudo iptables -I INPUT -s 192.168.56.11 -j DROP)
Wait for the allocation to transition to the unknown status
Simulate a network reconnect for the targeted client (e.g. sudo iptables -D INPUT -s 192.168.56.11 -j DROP)
When the allocation transition back to running, notice that a second allocation was started and immediately stopped
Notice the original allocation did not stop

Allocations
ID                Node ID      Task Group  Version  Desired  Status       Created         Modified
2f8d72f2    94027db7   cache           0            stop       complete  2s ago            1s ago
043e25bd  94027db7   cache           0            run         running     1m44s ago    2s ago

Expected Result

The original allocation reconnects and no additional allocations are scheduled

Actual Result

An extra allocation was scheduled and then immediately stopped

Job file (if appropriate)

job "no-replace" {
  datacenters = ["dc1"]

  group "cache" {
    count = 1

    max_client_disconnect = "2m"

    constraint {
      attribute = "${node.unique.name}"
      value     = "nomad-client02"
      operator  = "="
    }

    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

github-actions · 2023-03-05T02:23:53Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

DerekStrickland added type/bug theme/edge labels Apr 19, 2022

DerekStrickland added this to the 1.3.0 milestone Apr 19, 2022

DerekStrickland self-assigned this Apr 19, 2022

tgross modified the milestones: 1.3.0, 1.3.x Jun 17, 2022

mmcquillan removed this from the 1.3.x milestone Aug 31, 2022

lgfa29 mentioned this issue Oct 27, 2022

Update alloc after reconnect and enforece client heartbeat order #15068

Merged

lgfa29 closed this as completed in #15068 Nov 4, 2022

This was referenced Nov 4, 2022

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.3.x #15152

Merged

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

Merged

lgfa29 mentioned this issue Nov 22, 2022

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect #15144

Closed

lgfa29 mentioned this issue Jan 12, 2023

New behavior when max_client_disconnect is used and the worker node moves from disconnected to ready #15483

Closed

github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680

disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680

DerekStrickland commented Apr 19, 2022

github-actions bot commented Mar 5, 2023

disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680

disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680

Comments

DerekStrickland commented Apr 19, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

github-actions bot commented Mar 5, 2023