Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of driver/docker: Fix container CPU stats collection into release/1.9.x #24793

Merged

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #24768 to be assessed for backporting due to the inclusion of the label backport/1.9.x.

The below text is copied from the body of the original PR.


Description

The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test and the soak testing details seen below.

The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.

Testing & Reproduction steps

I used this lab to run a 1 server, 1 client cluster; Nomad was running the modified code from this PR. I then ran a Prometheus/Grafana job and the example Redis job with Prometheus scraping the local Nomad client every 1 second.

promana.nomad.hcl
job "promana" {
  group "promana" {
    network {
      mode = "bridge"
      port "prometheus" {
        to = 9090
      }
      port "grafana" {
        to = 3000
      }
    }

    service {
      name     = "prometheus-server"
      port     = "prometheus"
      provider = "nomad"
    }
    service {
      name     = "grafana-server"
      port     = "grafana"
      provider = "nomad"
    }

    task "prometheus" {
      driver = "docker"
      config {
        image = "prom/prometheus:v3.0.1"
        ports = ["prometheus"]
        args  = [
          "--config.file=${NOMAD_TASK_DIR}/config/prometheus.yml",
          "--storage.tsdb.path=/prometheus",
          "--web.listen-address=0.0.0.0:9090",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles",
        ]

        volumes = [
          "local/config:/etc/prometheus/config",
        ]
      }

      template {
        data = <<EOH
---
global:
  scrape_interval:     1s
  evaluation_interval: 1s

scrape_configs:
  - job_name: "nomad_server"
    metrics_path: "/v1/metrics"
    scheme: "http"
    params:
      format:
        - "prometheus"
    static_configs:
      - targets:
        - {{ env "attr.unique.network.ip-address" }}:4646
EOH
        change_mode   = "signal"
        change_signal = "SIGHUP"
        destination   = "local/config/prometheus.yml"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }

    task "grafana" {
      driver = "docker"

      config {
        image   = "grafana/grafana:11.4.0"
        volumes = [
          "local/datasources:/etc/grafana/provisioning/datasources",
        ]
      }

      template {
        data = <<EOH
apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://0.0.0.0:9090
  isDefault: true
  version: 1
  editable: false
EOH

        destination = "local/datasources/datasources.yaml"
      }

      resources {
        cpu    = 200
        memory = 256
      }
    }
  }
}
example.nomad.hcl
job "example" {

  group "cache" {
    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

The cluster and jobs were left to run for 6hrs before taking a look at the available metrics, including previously affected CPU percentage and client go routine count.

screenshots

image
image
image

Links

Closes: #24740
Internal: https://hashicorp.atlassian.net/browse/NET-11922
Historical:

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

Overview of commits

@jrasell jrasell merged commit ecd558f into release/1.9.x Jan 7, 2025
24 checks passed
@jrasell jrasell deleted the backport/b-NET-11922/adequately-sweeping-terrier branch January 7, 2025 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants