Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-server reporting inconsistent numbers of control plane nodes #803

Closed
techstep opened this issue Jul 27, 2021 · 15 comments
Closed

metrics-server reporting inconsistent numbers of control plane nodes #803

techstep opened this issue Jul 27, 2021 · 15 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@techstep
Copy link

What happened:

When I run kubectl top nodes, or kubectl get nodemetrics on a k8s cluster with metrics-server, I almost always have at least one control-plane node unaccounted for. The missing control plane node(s) change every minute with every run. All three control plane nodes are up and healthy, and the worker nodes show up all the time.

What you expected to happen:

I expected to see all three worker nodes, and all three control plane nodes.

Anything else we need to know?:

  • I have looked through the metrics-server logs, and found that the requests to the nodes, control plane and worker, received 200 responses; moreover, manually making those requests returned metrics I was expecting to see.

  • While the control planes flicker in and out of existence on the aforementioned commands, the actual number and type of pods remains consistent, and the metrics for the pods look completely fine.

  • The problem persists whether I am running on one or two replicas.

  • We are running metrics-server on the control plane, because we could not get metrics for pods running on the control plane otherwise.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): kubeadm on top of OpenStack using ClusterAPI

  • Container Network Setup (flannel, calico, etc.): calico

  • Kubernetes version (use kubectl version): 1.21 (client), 1.20 (server)

  • Metrics Server manifest

spoiler for Metrics Server manifest:

apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "26"
      meta.helm.sh/release-name: metrics-server
      meta.helm.sh/release-namespace: metrics-server
    creationTimestamp: "2021-07-13T18:41:53Z"
    generation: 26
 labels:
      app.kubernetes.io/instance: metrics-server
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: metrics-server
      helm.sh/chart: metrics-server-5.8.14
    name: metrics-server
    namespace: metrics-server
    resourceVersion: "11957101"
    uid: `[redacted]`
  spec:
    progressDeadlineSeconds: 600
    replicas: 2
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app.kubernetes.io/instance: metrics-server
        app.kubernetes.io/name: metrics-server
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        annotations:
          ad.datadoghq.com/nginx-ingress-controller.check_names: '["kube_metrics_server"]'
          ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
          ad.datadoghq.com/nginx-ingress-controller.instances: |
            [
              {
                "prometheus_url": "https://%%host%%:443/metrics"
              }
            ]
          enable.version-checker.io/metrics-server: "true"
          override-url.version-checker.io/metrics-server: bitnami/metrics-server
        creationTimestamp: null
        labels:
          app.kubernetes.io/instance: metrics-server
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/name: metrics-server
          helm.sh/chart: metrics-server-5.8.14
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: node-role.kubernetes.io/master
                  operator: Exists
        containers:
        - command:
          - /pod_nanny
          - --config-dir=/etc/config
          - --cpu=100m
          - --extra-cpu=7m
          - --memory=300Mi
          - --extra-memory=3Mi
          - --threshold=10
          - --deployment=metrics-server
          - --container=metrics-server
          env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: ADDON_NAME
            value: metrics
          image: [image_mirror]/k8s.gcr.io/addon-resizer:1.8.11
          imagePullPolicy: IfNotPresent
          name: pod-nanny
          resources:
            limits:
              cpu: 100m
              memory: 20Mi
            requests:
              cpu: 100m
              memory: 20Mi
          securityContext:
            runAsGroup: 65534
            runAsUser: 65534
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/config
            name: nanny-config-volume
        - args:
          - --secure-port=8443
          - --cert-dir=/tmp
          - --kubelet-insecure-tls=true
          - --kubelet-preferred-address-types=\[InternalDNS,InternalIP,ExternalDNS,ExternalIP\]
          - --profiling=true
          command:
          - metrics-server
          image: [image_mirror]/bitnami/metrics-server:0.5.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /livez
              port: https
              scheme: HTTPS
            initialDelaySeconds: 40
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: metrics-server
          ports:
          - containerPort: 8443
            hostPort: 8443
            name: https
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: https
              scheme: HTTPS
            initialDelaySeconds: 40
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 142m
              memory: 318Mi
            requests:
              cpu: 142m
              memory: 318Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsGroup: 10001
            runAsNonRoot: true
            runAsUser: 10001
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/config
            name: nanny-config-volume
          - mountPath: /tmp
            name: tmpdir
        dnsPolicy: ClusterFirst
        hostNetwork: true
        imagePullSecrets:
        - name: regcred-pseudo
        priorityClassName: highest-platform
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: metrics-server
        serviceAccountName: metrics-server
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Exists
        volumes:
        - configMap:
            defaultMode: 420
            name: nanny-config-metrics-server
          name: nanny-config-volume
        - emptyDir: {}
          name: tmpdir
  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2021-07-27T20:09:32Z"
      lastUpdateTime: "2021-07-27T20:09:32Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2021-07-13T18:41:54Z"
      lastUpdateTime: "2021-07-27T20:10:01Z"
      message: ReplicaSet "metrics-server-[redacted]" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 26
    readyReplicas: 2
    replicas: 2
    updatedReplicas: 2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

  • Kubelet config:
spoiler for Kubelet config:
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: [redacted]
    server: https://[redacted]:6443
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       app.kubernetes.io/instance=metrics-server
            app.kubernetes.io/managed-by=Helm
            app.kubernetes.io/name=metrics-server
            helm.sh/chart=metrics-server-5.8.14
Annotations:  meta.helm.sh/release-name: metrics-server
            meta.helm.sh/release-namespace: metrics-server
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
Creation Timestamp:  2021-07-13T18:57:27Z
Resource Version:    11462943
UID:                 86dd3191-802e-4695-996a-017984296eff
Spec:
Group:                     metrics.k8s.io
Group Priority Minimum:    100
Insecure Skip TLS Verify:  true
Service:
  Name:            metrics-server
  Namespace:       metrics-server
  Port:            443
Version:           v1beta1
Version Priority:  100
Status:
Conditions:
  Last Transition Time:  2021-07-25T06:47:48Z
  Message:               all checks passed
  Reason:                Passed
  Status:                True
  Type:                  Available
Events:                    <none>

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2021
@yangjunmyfm192085
Copy link
Contributor

Could you provide more information about Raw API result(as issue #792) and logs of metrics-server?
Let's also analyse the data from kubelet

@techstep
Copy link
Author

techstep commented Aug 3, 2021

Here's the logs from metrics-server when running the specific query (running metrics-server with -v=8). The logs for the request imply that everything is fine, that everything is returning with 200s, but kubectl top nodes in this case returns one control plane node, not the three.

Request logs
metrics-server I0803 18:58:17.923289       1 server.go:136] "Scraping metrics"
metrics-server I0803 18:58:17.923369       1 scraper.go:114] "Scraping metrics from nodes" nodeCount=6
metrics-server I0803 18:58:17.927587       1 scraper.go:136] "Scraping node" node="test-us-west-1-md-0-dmqmp"
metrics-server I0803 18:58:17.927811       1 round_trippers.go:432] GET https://100.113.136.117:10250/stats/summary?only_cpu_and_memory=true
metrics-server I0803 18:58:17.927825       1 round_trippers.go:438] Request Headers:
metrics-server I0803 18:58:17.927836       1 round_trippers.go:442]     Authorization: Bearer <masked>
metrics-server I0803 18:58:17.927842       1 round_trippers.go:442]     User-Agent: metrics-server/v0.5.0 (linux/amd64) kubernetes/d766094
metrics-server I0803 18:58:17.941558       1 scraper.go:136] "Scraping node" node="test-us-west-1-control-plane-577lp"
metrics-server I0803 18:58:17.941628       1 round_trippers.go:432] GET https://100.113.137.187:10250/stats/summary?only_cpu_and_memory=true
metrics-server I0803 18:58:17.941636       1 round_trippers.go:438] Request Headers:
metrics-server I0803 18:58:17.941643       1 round_trippers.go:442]     User-Agent: metrics-server/v0.5.0 (linux/amd64) kubernetes/d766094
metrics-server I0803 18:58:17.941651       1 round_trippers.go:442]     Authorization: Bearer <masked>
metrics-server I0803 18:58:17.949539       1 scraper.go:136] "Scraping node" node="test-us-west-1-control-plane-7hkxd"
metrics-server I0803 18:58:17.949590       1 round_trippers.go:432] GET https://100.113.137.251:10250/stats/summary?only_cpu_and_memory=true
metrics-server I0803 18:58:17.949612       1 round_trippers.go:438] Request Headers:
metrics-server I0803 18:58:17.949620       1 round_trippers.go:442]     User-Agent: metrics-server/v0.5.0 (linux/amd64) kubernetes/d766094
metrics-server I0803 18:58:17.949659       1 round_trippers.go:442]     Authorization: Bearer <masked>
metrics-server I0803 18:58:17.954348       1 round_trippers.go:457] Response Status: 200 OK in 26 milliseconds
metrics-server I0803 18:58:17.954364       1 round_trippers.go:460] Response Headers:
metrics-server I0803 18:58:17.954373       1 round_trippers.go:463]     Content-Type: application/json
metrics-server I0803 18:58:17.954378       1 round_trippers.go:463]     Date: Tue, 03 Aug 2021 18:58:17 GMT
metrics-server I0803 18:58:17.954475       1 scraper.go:136] "Scraping node" node="test-us-west-1-md-0-zm24b"
metrics-server I0803 18:58:17.954550       1 round_trippers.go:432] GET https://100.113.136.132:10250/stats/summary?only_cpu_and_memory=true
metrics-server I0803 18:58:17.954562       1 round_trippers.go:438] Request Headers:
metrics-server I0803 18:58:17.954569       1 round_trippers.go:442]     User-Agent: metrics-server/v0.5.0 (linux/amd64) kubernetes/d766094
metrics-server I0803 18:58:17.954596       1 round_trippers.go:442]     Authorization: Bearer <masked>
metrics-server I0803 18:58:17.961507       1 scraper.go:136] "Scraping node" node="test-us-west-1-control-plane-qkcnm"
metrics-server I0803 18:58:17.961577       1 round_trippers.go:432] GET https://100.113.137.63:10250/stats/summary?only_cpu_and_memory=true
metrics-server I0803 18:58:17.961590       1 round_trippers.go:438] Request Headers:
metrics-server I0803 18:58:17.961596       1 round_trippers.go:442]     User-Agent: metrics-server/v0.5.0 (linux/amd64) kubernetes/d766094
metrics-server I0803 18:58:17.961603       1 round_trippers.go:442]     Authorization: Bearer <masked>
metrics-server I0803 18:58:17.966518       1 scraper.go:136] "Scraping node" node="test-us-west-1-md-0-9shcw"
metrics-server I0803 18:58:17.966592       1 round_trippers.go:432] GET https://100.113.136.191:10250/stats/summary?only_cpu_and_memory=true
metrics-server I0803 18:58:17.966603       1 round_trippers.go:438] Request Headers:
metrics-server I0803 18:58:17.966609       1 round_trippers.go:442]     User-Agent: metrics-server/v0.5.0 (linux/amd64) kubernetes/d766094
metrics-server I0803 18:58:17.966616       1 round_trippers.go:442]     Authorization: Bearer <masked>
metrics-server I0803 18:58:17.981479       1 round_trippers.go:457] Response Status: 200 OK in 14 milliseconds
metrics-server I0803 18:58:17.981498       1 round_trippers.go:460] Response Headers:
metrics-server I0803 18:58:17.981507       1 round_trippers.go:463]     Date: Tue, 03 Aug 2021 18:58:17 GMT
metrics-server I0803 18:58:17.981512       1 round_trippers.go:463]     Content-Type: application/json
metrics-server I0803 18:58:17.990917       1 round_trippers.go:457] Response Status: 200 OK in 36 milliseconds
metrics-server I0803 18:58:17.990934       1 round_trippers.go:460] Response Headers:
metrics-server I0803 18:58:17.990940       1 round_trippers.go:463]     Content-Type: application/json
metrics-server I0803 18:58:17.990945       1 round_trippers.go:463]     Date: Tue, 03 Aug 2021 18:58:17 GMT
metrics-server I0803 18:58:18.008613       1 round_trippers.go:457] Response Status: 200 OK in 66 milliseconds
metrics-server I0803 18:58:18.008626       1 round_trippers.go:460] Response Headers:
metrics-server I0803 18:58:18.008631       1 round_trippers.go:463]     Content-Type: application/json
metrics-server I0803 18:58:18.008663       1 round_trippers.go:463]     Date: Tue, 03 Aug 2021 18:58:18 GMT
metrics-server I0803 18:58:18.042276       1 round_trippers.go:457] Response Status: 200 OK in 80 milliseconds
metrics-server I0803 18:58:18.042293       1 round_trippers.go:460] Response Headers:
metrics-server I0803 18:58:18.042301       1 round_trippers.go:463]     Content-Type: application/json
metrics-server I0803 18:58:18.042306       1 round_trippers.go:463]     Date: Tue, 03 Aug 2021 18:58:18 GMT
metrics-server I0803 18:58:18.052463       1 round_trippers.go:457] Response Status: 200 OK in 102 milliseconds
metrics-server I0803 18:58:18.052490       1 round_trippers.go:460] Response Headers:
metrics-server I0803 18:58:18.052502       1 round_trippers.go:463]     Content-Type: application/json
metrics-server I0803 18:58:18.052511       1 round_trippers.go:463]     Date: Tue, 03 Aug 2021 18:58:18 GMT
metrics-server I0803 18:58:18.052921       1 scraper.go:157] "Scrape finished" duration="129.533693ms" nodeCount=6 podCount=81
metrics-server I0803 18:58:18.052930       1 server.go:139] "Storing metrics"
metrics-server I0803 18:58:18.053128       1 server.go:144] "Scraping cycle complete"

And the output of kubectl top nodes --use-protocol-buffers:

command output
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
test-us-west-1-control-plane-7hkxd   380m         9%     2824Mi          73%
test-us-west-1-md-0-9shcw            418m         10%    2798Mi          47%
test-us-west-1-md-0-dmqmp            312m         7%     2723Mi          46%
test-us-west-1-md-0-zm24b            301m         7%     2749Mi          47%
test-us-west-1-control-plane-577lp   <unknown>                           <unknown>               <unknown>               <unknown>
test-us-west-1-control-plane-qkcnm   <unknown>                           <unknown>               <unknown>               <unknown>

@techstep
Copy link
Author

techstep commented Aug 3, 2021

I ran the following code:

while true; do 
for i in 187 251 63; do 
curl -ik -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://100.113.137.$i:10250/stats/summary\?only_cpu_and_memory=true -o control-plane-$i-`date +"%Y%m%d-%H%M%S"`.json; 
done;
sleep 10
done

on a metrics-server node to run these stats/summary?only_cpu_and_memory=true queries. In particular, I ran these queries before, during, and after the time I was getting missing nodes from the output of kubectl top nodes, which was running on a 10-second watch loop.

output from control plane node
{
 "node": {
  "nodeName": "test-us-west-1-control-plane-577lp",
  "systemContainers": [
   {
    "name": "pods",
    "startTime": "2021-08-03T19:33:46Z",
    "cpu": {
     "time": "2021-08-03T19:34:36Z",
     "usageNanoCores": 270639291,
     "usageCoreNanoSeconds": 1171995323473710
    },
    "memory": {
     "time": "2021-08-03T19:34:36Z",
     "availableBytes": 2518097920,
     "usageBytes": 1873137664,
     "workingSetBytes": 1608773632,
     "rssBytes": 1586286592,
     "pageFaults": 0,
     "majorPageFaults": 0
    }
   },
   {
    "name": "kubelet",
    "startTime": "2021-05-25T05:37:19Z",
    "cpu": {
     "time": "2021-08-03T19:34:26Z",
     "usageNanoCores": 37437406,
     "usageCoreNanoSeconds": 172566180358587
    },
    "memory": {
     "time": "2021-08-03T19:34:26Z",
     "usageBytes": 102686720,
     "workingSetBytes": 80879616,
     "rssBytes": 64901120,
     "pageFaults": 2426534022,
     "majorPageFaults": 34419
    }
   }
  ],
  "startTime": "2021-08-03T19:33:50Z",
  "cpu": {
   "time": "2021-08-03T19:34:36Z",
   "usageNanoCores": 364450073,
   "usageCoreNanoSeconds": 1547019621854448
  },
  "memory": {
   "time": "2021-08-03T19:34:36Z",
   "availableBytes": 1233833984,
   "usageBytes": 3665260544,
   "workingSetBytes": 2893037568,
   "rssBytes": 1818714112,
   "pageFaults": 48411,
   "majorPageFaults": 99
  }
 },
 "pods": [
  {
   "podRef": {
    "name": "metrics-server-695c48797c-tptnf",
    "namespace": "metrics-server",
    "uid": "bf93b656-1582-49cf-bc10-b9620a3555a0"
   },
   "startTime": "2021-07-28T17:50:27Z",
   "containers": [
    {
     "name": "pod-nanny",
     "startTime": "2021-07-28T17:50:27Z",
     "cpu": {
      "time": "2021-08-03T19:34:32Z",
      "usageNanoCores": 180233,
      "usageCoreNanoSeconds": 100820965334
     },
     "memory": {
      "time": "2021-08-03T19:34:32Z",
      "availableBytes": 10649600,
      "usageBytes": 13307904,
      "workingSetBytes": 10321920,
      "rssBytes": 7180288,
      "pageFaults": 11616,
      "majorPageFaults": 1881
     }
    },
    {
     "name": "metrics-server",
     "startTime": "2021-07-28T17:50:28Z",
     "cpu": {
      "time": "2021-08-03T19:34:38Z",
      "usageNanoCores": 3390945,
      "usageCoreNanoSeconds": 2049563278770
     },
     "memory": {
      "time": "2021-08-03T19:34:38Z",
      "availableBytes": 301834240,
      "usageBytes": 44208128,
      "workingSetBytes": 31612928,
      "rssBytes": 30502912,
      "pageFaults": 56298,
      "majorPageFaults": 4257
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:37Z",
    "usageNanoCores": 4143885,
    "usageCoreNanoSeconds": 2150410328900
   },
   "memory": {
    "time": "2021-08-03T19:34:37Z",
    "availableBytes": 311881728,
    "usageBytes": 58118144,
    "workingSetBytes": 42536960,
    "rssBytes": 37556224,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "calico-node-pmg6h",
    "namespace": "kube-system",
    "uid": "986e7980-868f-45c6-8073-4a3416125e08"
   },
   "startTime": "2021-06-08T12:07:49Z",
   "containers": [
    {
     "name": "calico-node",
     "startTime": "2021-06-08T12:07:54Z",
     "cpu": {
      "time": "2021-08-03T19:34:33Z",
      "usageNanoCores": 19360202,
      "usageCoreNanoSeconds": 107480343458651
     },
     "memory": {
      "time": "2021-08-03T19:34:33Z",
      "usageBytes": 113590272,
      "workingSetBytes": 108130304,
      "rssBytes": 61505536,
      "pageFaults": 4223765865,
      "majorPageFaults": 23133
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:39Z",
    "usageNanoCores": 22898353,
    "usageCoreNanoSeconds": 107480929830994
   },
   "memory": {
    "time": "2021-08-03T19:34:39Z",
    "usageBytes": 114692096,
    "workingSetBytes": 109232128,
    "rssBytes": 61288448,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "test-g8psk",
    "namespace": "default",
    "uid": "15e3279a-b266-4e36-a461-80eca8ef0a7b"
   },
   "startTime": "2021-06-03T21:41:57Z",
   "containers": [
    {
     "name": "shell",
     "startTime": "2021-06-17T00:09:42Z",
     "cpu": {
      "time": "2021-08-03T19:34:39Z",
      "usageNanoCores": 0,
      "usageCoreNanoSeconds": 220740717
     },
     "memory": {
      "time": "2021-08-03T19:34:39Z",
      "usageBytes": 2985984,
      "workingSetBytes": 2732032,
      "rssBytes": 24576,
      "pageFaults": 10923,
      "majorPageFaults": 132
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:40Z",
    "usageNanoCores": 0,
    "usageCoreNanoSeconds": 256198188
   },
   "memory": {
    "time": "2021-08-03T19:34:40Z",
    "usageBytes": 3596288,
    "workingSetBytes": 3342336,
    "rssBytes": 0,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "falco-frdvw",
    "namespace": "falco",
    "uid": "67a68e22-a866-42c4-be68-7a028f81835f"
   },
   "startTime": "2021-07-12T19:17:55Z",
   "containers": [
    {
     "name": "falco",
     "startTime": "2021-07-12T19:18:00Z",
     "cpu": {
      "time": "2021-08-03T19:34:26Z",
      "usageNanoCores": 18779041,
      "usageCoreNanoSeconds": 44824479061195
     },
     "memory": {
      "time": "2021-08-03T19:34:26Z",
      "availableBytes": 1012006912,
      "usageBytes": 61906944,
      "workingSetBytes": 61734912,
      "rssBytes": 58675200,
      "pageFaults": 310299,
      "majorPageFaults": 297
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:36Z",
    "usageNanoCores": 24610354,
    "usageCoreNanoSeconds": 44824825915501
   },
   "memory": {
    "time": "2021-08-03T19:34:36Z",
    "availableBytes": 1011412992,
    "usageBytes": 62636032,
    "workingSetBytes": 62328832,
    "rssBytes": 58556416,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "goldpinger-9h28f",
    "namespace": "goldpinger",
    "uid": "de9ed254-5677-43c2-9dbc-bd43e6e0bacf"
   },
   "startTime": "2021-06-17T22:16:13Z",
   "containers": [
    {
     "name": "goldpinger",
     "startTime": "2021-06-17T22:16:18Z",
     "cpu": {
      "time": "2021-08-03T19:34:38Z",
      "usageNanoCores": 656481,
      "usageCoreNanoSeconds": 3607231811312
     },
     "memory": {
      "time": "2021-08-03T19:34:38Z",
      "availableBytes": 63569920,
      "usageBytes": 27750400,
      "workingSetBytes": 20316160,
      "rssBytes": 20336640,
      "pageFaults": 243903,
      "majorPageFaults": 11715
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:34Z",
    "usageNanoCores": 719180,
    "usageCoreNanoSeconds": 3607272866898
   },
   "memory": {
    "time": "2021-08-03T19:34:34Z",
    "availableBytes": 62742528,
    "usageBytes": 28577792,
    "workingSetBytes": 21143552,
    "rssBytes": 20054016,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "kube-proxy-qthj7",
    "namespace": "kube-system",
    "uid": "b97a8af3-a80b-4926-bf32-8e444a43825c"
   },
   "startTime": "2021-05-25T05:37:23Z",
   "containers": [
    {
     "name": "kube-proxy",
     "startTime": "2021-05-25T05:37:23Z",
     "cpu": {
      "time": "2021-08-03T19:34:34Z",
      "usageNanoCores": 7087456,
      "usageCoreNanoSeconds": 17138928259050
     },
     "memory": {
      "time": "2021-08-03T19:34:34Z",
      "usageBytes": 34574336,
      "workingSetBytes": 27643904,
      "rssBytes": 19759104,
      "pageFaults": 1027524432,
      "majorPageFaults": 17985
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:40Z",
    "usageNanoCores": 4594782,
    "usageCoreNanoSeconds": 17138941972237
   },
   "memory": {
    "time": "2021-08-03T19:34:40Z",
    "usageBytes": 35254272,
    "workingSetBytes": 28323840,
    "rssBytes": 19701760,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "kube-scheduler-test-us-west-1-control-plane-577lp",
    "namespace": "kube-system",
    "uid": "9be8cb4627e7e5ad4c3f8acabd4b49b3"
   },
   "startTime": "2021-05-25T05:37:24Z",
   "containers": [
    {
     "name": "kube-scheduler",
     "startTime": "2021-05-25T05:37:25Z",
     "cpu": {
      "time": "2021-08-03T19:34:42Z",
      "usageNanoCores": 3353050,
      "usageCoreNanoSeconds": 13818935012985
     },
     "memory": {
      "time": "2021-08-03T19:34:42Z",
      "usageBytes": 45481984,
      "workingSetBytes": 40767488,
      "rssBytes": 33591296,
      "pageFaults": 62337,
      "majorPageFaults": 13332
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:27Z",
    "usageNanoCores": 2089896,
    "usageCoreNanoSeconds": 13818925016594
   },
   "memory": {
    "time": "2021-08-03T19:34:27Z",
    "usageBytes": 46161920,
    "workingSetBytes": 41447424,
    "rssBytes": 33570816,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "node-problem-detector-t6vms",
    "namespace": "node-problem-detector",
    "uid": "a60918ff-1b57-46f3-863e-c9a9e6efd363"
   },
   "startTime": "2021-06-21T20:59:47Z",
   "containers": [
    {
     "name": "node-problem-detector",
     "startTime": "2021-06-21T20:59:51Z",
     "cpu": {
      "time": "2021-08-03T19:34:33Z",
      "usageNanoCores": 326959,
      "usageCoreNanoSeconds": 1567543118508
     },
     "memory": {
      "time": "2021-08-03T19:34:33Z",
      "availableBytes": 52670464,
      "usageBytes": 19095552,
      "workingSetBytes": 14438400,
      "rssBytes": 13729792,
      "pageFaults": 259380,
      "majorPageFaults": 4620
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:33Z",
    "usageNanoCores": 318479,
    "usageCoreNanoSeconds": 1567558263836
   },
   "memory": {
    "time": "2021-08-03T19:34:33Z",
    "availableBytes": 52084736,
    "usageBytes": 19681280,
    "workingSetBytes": 15024128,
    "rssBytes": 13643776,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "datadog-v8x66",
    "namespace": "datadog",
    "uid": "e0047b06-4ed1-4cca-97d8-11edf720d102"
   },
   "startTime": "2021-07-13T18:52:57Z",
   "containers": [
    {
     "name": "process-agent",
     "startTime": "2021-07-13T18:53:02Z",
     "cpu": {
      "time": "2021-08-03T19:34:36Z",
      "usageNanoCores": 3032691,
      "usageCoreNanoSeconds": 7315957210925
     },
     "memory": {
      "time": "2021-08-03T19:34:36Z",
      "availableBytes": 370860032,
      "usageBytes": 53485568,
      "workingSetBytes": 48570368,
      "rssBytes": 43458560,
      "pageFaults": 272184,
      "majorPageFaults": 5973
     }
    },
    {
     "name": "agent",
     "startTime": "2021-07-13T18:53:02Z",
     "cpu": {
      "time": "2021-08-03T19:34:39Z",
      "usageNanoCores": 125317680,
      "usageCoreNanoSeconds": 226387905934640
     },
     "memory": {
      "time": "2021-08-03T19:34:39Z",
      "availableBytes": 75259904,
      "usageBytes": 367538176,
      "workingSetBytes": 327393280,
      "rssBytes": 343683072,
      "pageFaults": 71382069,
      "majorPageFaults": 11121
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:33Z",
    "usageNanoCores": 129694728,
    "usageCoreNanoSeconds": 233703275027230
   },
   "memory": {
    "time": "2021-08-03T19:34:33Z",
    "availableBytes": 443330560,
    "usageBytes": 423813120,
    "workingSetBytes": 378753024,
    "rssBytes": 389124096,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "kube-controller-manager-test-us-west-1-control-plane-577lp",
    "namespace": "kube-system",
    "uid": "a40fb931ece5fcc5db1085981df97fea"
   },
   "startTime": "2021-05-25T05:37:24Z",
   "containers": [
    {
     "name": "kube-controller-manager",
     "startTime": "2021-05-25T05:37:25Z",
     "cpu": {
      "time": "2021-08-03T19:34:29Z",
      "usageNanoCores": 15165979,
      "usageCoreNanoSeconds": 106956688334361
     },
     "memory": {
      "time": "2021-08-03T19:34:29Z",
      "usageBytes": 122839040,
      "workingSetBytes": 105414656,
      "rssBytes": 98041856,
      "pageFaults": 119559,
      "majorPageFaults": 28347
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:31Z",
    "usageNanoCores": 13252067,
    "usageCoreNanoSeconds": 106956756198148
   },
   "memory": {
    "time": "2021-08-03T19:34:31Z",
    "usageBytes": 123641856,
    "workingSetBytes": 106217472,
    "rssBytes": 97947648,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "etcd-test-us-west-1-control-plane-577lp",
    "namespace": "kube-system",
    "uid": "d6ac6e5189a596324d657fb2283dc044"
   },
   "startTime": "2021-05-25T05:37:24Z",
   "containers": [
    {
     "name": "etcd",
     "startTime": "2021-05-25T05:37:26Z",
     "cpu": {
      "time": "2021-08-03T19:34:26Z",
      "usageNanoCores": 25744216,
      "usageCoreNanoSeconds": 146208956956459
     },
     "memory": {
      "time": "2021-08-03T19:34:26Z",
      "usageBytes": 142880768,
      "workingSetBytes": 108838912,
      "rssBytes": 107782144,
      "pageFaults": 1905585,
      "majorPageFaults": 100749
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:35Z",
    "usageNanoCores": 21348447,
    "usageCoreNanoSeconds": 146209178208128
   },
   "memory": {
    "time": "2021-08-03T19:34:35Z",
    "usageBytes": 143663104,
    "workingSetBytes": 109621248,
    "rssBytes": 107687936,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  },
  {
   "podRef": {
    "name": "kube-apiserver-test-us-west-1-control-plane-577lp",
    "namespace": "kube-system",
    "uid": "adfe87522ebbb0293cc2814e0806dc5f"
   },
   "startTime": "2021-05-25T05:37:24Z",
   "containers": [
    {
     "name": "kube-apiserver",
     "startTime": "2021-05-25T05:37:25Z",
     "cpu": {
      "time": "2021-08-03T19:34:37Z",
      "usageNanoCores": 52532418,
      "usageCoreNanoSeconds": 349209320675957
     },
     "memory": {
      "time": "2021-08-03T19:34:37Z",
      "usageBytes": 785354752,
      "workingSetBytes": 663539712,
      "rssBytes": 747573248,
      "pageFaults": 4107840,
      "majorPageFaults": 35904
     }
    }
   ],
   "cpu": {
    "time": "2021-08-03T19:34:42Z",
    "usageNanoCores": 48563154,
    "usageCoreNanoSeconds": 349209562111403
   },
   "memory": {
    "time": "2021-08-03T19:34:42Z",
    "usageBytes": 785985536,
    "workingSetBytes": 664170496,
    "rssBytes": 747425792,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  }
 ]
 }
    

@techstep
Copy link
Author

Just poking in to see if anything's going on. I'm a bit flummoxed with this issue.

@yangjunmyfm192085
Copy link
Contributor

I don't see any problems from the metrics above, but I noticed that the startup time of the node test-us-west-1-control-plane-577lp is "startTime": "2021-08-03T19:33:50Z",, and timestamp reported by metrics is "2021-08-03T19:34:36Z", , Is the node running normally?

@techstep
Copy link
Author

techstep commented Aug 11, 2021

The three nodes have been running normally, as far as I can tell. I'm not sure why was a 46-second difference in the timestamps of those two.

Moreover, I'm not sure why it's just an issue with the control-plane nodes. I have taken a look at this dozens, if not hundreds, of times, in the past several weeks, and not once have I see any of the three worker nodes not show up. And again, metrics-server is always getting 200s when pulling the data from the nodes, whether control plane or worker.

Is there a reason why a node wouldn't show up in the metrics-server in memory store even after metrics-server got the data?

@yangjunmyfm192085
Copy link
Contributor

Yeah, I can't got the reason, but we really found there was a 46-second difference in the timestamps of those two.
we need at least two cycles of data before before exposing nodeMetrics after the node is started.
So could we try to get help from node team?

@yangjunmyfm192085
Copy link
Contributor

@techstep, thanks for your feedback, I open an issue 104445 about sig node to track it.
If any other information, please help to add it

@serathius
Copy link
Contributor

FYI we don't support bitnami images as we don't even know what MS version they use or if they do any code changes.

Please confirm if I understood the problem, Kubelet reports invalid node start time for control plane nodes resulting in MS sometimes not reporting node metrics for those nodes?

@yangjunmyfm192085
Copy link
Contributor

I agree with @serathius, the reason is Kubelet reports invalid node start time for control plane nodes resulting in MS sometimes not reporting node metrics for those nodes

@serathius
Copy link
Contributor

This means that v0.5.0 should not use Kubelet start time. I think we should fix this and release v0.5.1. @yangjunmyfm192085 what do you think?

@serathius
Copy link
Contributor

ping @yangjunmyfm192085

@yangjunmyfm192085
Copy link
Contributor

ping @yangjunmyfm192085

Ok, Let me prepare for it

@serathius
Copy link
Contributor

Fix was implemented and released in v0.5.1

@serathius
Copy link
Contributor

@techstep Please confirm if that fixes the issue for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants