Backport of driver/docker: Fix container CPU stats collection into release/1.9.x #24793

hc-github-team-nomad-core · 2025-01-07T07:42:52Z

Backport

This PR is auto-generated from #24768 to be assessed for backporting due to the inclusion of the label backport/1.9.x.

The below text is copied from the body of the original PR.

Description

The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test and the soak testing details seen below.

The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.

Testing & Reproduction steps

I used this lab to run a 1 server, 1 client cluster; Nomad was running the modified code from this PR. I then ran a Prometheus/Grafana job and the example Redis job with Prometheus scraping the local Nomad client every 1 second.

promana.nomad.hcl

job "promana" {
  group "promana" {
    network {
      mode = "bridge"
      port "prometheus" {
        to = 9090
      }
      port "grafana" {
        to = 3000
      }
    }

    service {
      name     = "prometheus-server"
      port     = "prometheus"
      provider = "nomad"
    }
    service {
      name     = "grafana-server"
      port     = "grafana"
      provider = "nomad"
    }

    task "prometheus" {
      driver = "docker"
      config {
        image = "prom/prometheus:v3.0.1"
        ports = ["prometheus"]
        args  = [
          "--config.file=${NOMAD_TASK_DIR}/config/prometheus.yml",
          "--storage.tsdb.path=/prometheus",
          "--web.listen-address=0.0.0.0:9090",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles",
        ]

        volumes = [
          "local/config:/etc/prometheus/config",
        ]
      }

      template {
        data = <<EOH
---
global:
  scrape_interval:     1s
  evaluation_interval: 1s

scrape_configs:
  - job_name: "nomad_server"
    metrics_path: "/v1/metrics"
    scheme: "http"
    params:
      format:
        - "prometheus"
    static_configs:
      - targets:
        - {{ env "attr.unique.network.ip-address" }}:4646
EOH
        change_mode   = "signal"
        change_signal = "SIGHUP"
        destination   = "local/config/prometheus.yml"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }

    task "grafana" {
      driver = "docker"

      config {
        image   = "grafana/grafana:11.4.0"
        volumes = [
          "local/datasources:/etc/grafana/provisioning/datasources",
        ]
      }

      template {
        data = <<EOH
apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://0.0.0.0:9090
  isDefault: true
  version: 1
  editable: false
EOH

        destination = "local/datasources/datasources.yaml"
      }

      resources {
        cpu    = 200
        memory = 256
      }
    }
  }
}

example.nomad.hcl

job "example" {

  group "cache" {
    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

The cluster and jobs were left to run for 6hrs before taking a look at the available metrics, including previously affected CPU percentage and client go routine count.

screenshots

Links

Closes: #24740
Internal: https://hashicorp.atlassian.net/browse/NET-11922
Historical:

Docker Client migration: docker: use official client instead of fsouza/go-dockerclient #23966
Collection via streaming: docker: use streaming stats collection to correct CPU stats #24229
Collection via one-shot: [gh-24339] Move from streaming stats to polling for docker #24525

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

Overview of commits

0726e4c

backport of commit 0726e4c

e8392be

hc-github-team-nomad-core assigned jrasell Jan 7, 2025

hc-github-team-nomad-core requested a review from jrasell January 7, 2025 07:42

vercel bot deployed to Preview – nomad-ui January 7, 2025 07:46 View deployment

jrasell approved these changes Jan 7, 2025

View reviewed changes

jrasell merged commit ecd558f into release/1.9.x Jan 7, 2025
24 checks passed

jrasell deleted the backport/b-NET-11922/adequately-sweeping-terrier branch January 7, 2025 08:01

shoenig mentioned this pull request Jan 8, 2025

detour release 1 9 4 post release #24812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of driver/docker: Fix container CPU stats collection into release/1.9.x #24793

Backport of driver/docker: Fix container CPU stats collection into release/1.9.x #24793

hc-github-team-nomad-core commented Jan 7, 2025

Backport of driver/docker: Fix container CPU stats collection into release/1.9.x #24793

Backport of driver/docker: Fix container CPU stats collection into release/1.9.x #24793

Conversation

hc-github-team-nomad-core commented Jan 7, 2025

Backport

Description

Testing & Reproduction steps

Links

Contributor Checklist

Reviewer Checklist