Currently the options to monitor Cluster Autoscaler are limited to logs, status configmap and events. While useful for debugging, none of this options is particularly practical for monitoring Cluster Autoscaler behaviour over time. This document describes a set of metrics (in Prometheus format) that will be added to 0.6 Cluster Autoscaler to allow better monitoring of its operations.
Cluster Autoscaler already has a metrics endpoint providing some basic metrics. This includes default process metrics (number of goroutines, gc duration, cpu and memory details, etc) as well as some custom metrics related to time taken by various parts of Cluster Autoscaler main loop. Existing metrics will be renamed and refactored and new metrics will be added.
Cluster Autoscaler needs to monitor some aspects of cluster state as part of its normal operations. It can therefore provide metrics describing its own operations as well as some metrics related to general state of the cluster.
All the metrics are prefixed with cluster_autoscaler_
.
Metric name | Metric type | Labels | Description |
---|---|---|---|
cluster_safe_to_autoscale | Gauge | Whether or not cluster is healthy enough for autoscaling. 1 if it is, 0 otherwise. | |
nodes_count | Gauge | state =<node-state> |
Number of nodes in cluster. |
unschedulable_pods_count | Gauge | Number of unschedulable ("Pending") pods in the cluster. | |
node_groups_count | Gauge | node_group_type =<node-group-type> |
Number of node groups managed by CA. |
max_nodes_count | Gauge | Maximum number of nodes in all node groups. | |
cluster_cpu_current_cores | Gauge | ||
cpu_limits_cores | Gauge | direction =<minimum or maximum > |
Minimum and maximum number of cores in the cluster. |
cluster_memory_current_bytes | Gauge | Current number of bytes of memory in the cluster, minus deleting nodes. | |
memory_limits_bytes | Gauge | direction =<minimum or maximum > |
Minimum and maximum number of bytes of memory in cluster. |
cluster_safe_to_autoscale
indicates whether cluster is healthy enough for autoscaling. CA stops all operations if significant number of nodes are unready (by default 33% as of CA 0.5.4).nodes_count
records the total number of nodes, labeled by node state. Possible states areready
,unready
,notStarted
.node_groups_count
records the number of currently managed node groups. It's useful when using dynamic configuration or Node Autoprovisioning. Types of node group areautoscaled
(managed by CA but not created by NAP) andautoprovisioned
(created by NAP and managed by CA).
This metrics are refactored from currently existing metrics and track execution of various parts of Cluster Autoscaler loop.
Metric name | Metric type | Labels | Description |
---|---|---|---|
last_activity | Gauge | activity =<autoscaler-activity> |
Last time certain part of CA logic executed |
function_duration_seconds | Histogram | function =<autoscaler-function> |
Time taken by various parts of CA main loop. |
-
last_activity
records last time certain part of cluster autoscaler logic executed. Represented with unix timestamp. autoscaler-activity values are:main
- main loop iteration started.autoscaling
- current state of the cluster has been updated, started autoscaling logic.scaleUp
- autoscaler will check if scale up is necessary.scaleDown
- autoscaler will try to scale down some nodes.
Not all of the above have to happen in every loop. For example if CA adds a new node it will completely skip scale down logic in this loop.
-
function_duration_seconds
summarizes time taken by different functions executed in main CA goroutine. Uses the following set of values for autoscaler-function:main
- duration of the whole iteration of main loop.updateClusterState
- time used by CA to get node status from API server and update internal data structures.scaleUp
- time used to check if new node are necessary and add them.findUnneeded
- time required to find nodes that are candidates for removal.scaleDown
- time required to verify unneeded nodes are really unnecessary and remove them.
New labels may be added to both last_activity
and function_duration_seconds
if we add more features or additional logic to Cluster Autoscaler.
This metrics describe internal state and actions taken by Cluster Autoscaler.
Metric name | Metric type | Labels | Description |
---|---|---|---|
errors_total | Counter | type =<error-type> |
The number of CA loops failed due to an error. |
scaled_up_nodes_total | Counter | Number of nodes added by CA. | |
scaled_down_nodes_total | Counter | reason =<scale-down-reason> |
Number of nodes removed by CA. |
scaled_up_gpu_nodes_total | Counter | gpu_name =<gpu-name> |
Number of GPU-enabled nodes added by CA. |
scaled_down_gpu_nodes_total | Counter | reason =<scale-down-reason>, gpu_name =<gpu-name> |
Number of GPU-enabled nodes removed by CA. |
failed_scale_ups_total | Counter | reason =<failure-reason> |
Number of times scale-up operation has failed. |
evicted_pods_total | Counter | Number of pods evicted by CA. | |
unneeded_nodes_count | Gauge | Number of nodes currently considered unneeded by CA. | |
old_unregistered_nodes_removed_count | Counter | Number of unregistered nodes removed by CA. | |
skipped_scale_events_count | Counter | direction =<scaling-direction>, reason =<skipped-scale-reason> |
Number of times scaling has been skipped due to a resource limit being reached, or similar event. |
errors_total
counter increases every time main CA loop encounters an error.- Growing
errors_total
count signifies an internal error in CA or a problem with underlying infrastructure preventing normal CA operation. Example errors include:- failed to get list of nodes or pods from API server,
- failed to retrieve node group size from cloud provider,
- failed to update node group size,
- error in CA simulations,
- error updating internal data structures.
- Not every condition causing CA to skip part of main loop is an error in this context. For example having pending pods after reaching maximum cluster size causes ScaleUp function to return error resulting in CA skipping the rest of main loop. This is expected behaviour of CA and it should not be counted as error.
- Possible error types are:
cloudProviderError
- failed to get node group info, set node group size, etc.apiCallError
- error related to call to k8s api server (ex. get nodes, get pods).internalError
- error in any other part of CA logic.
- Growing
scaled_up_nodes_total
counts the number of nodes successfully added by CA. In this context we consider node as successfully added after updating node group size (without waiting for actual vm to spin up, run a kubelet, etc).failed_scale_ups_total
counts the number of unsuccessful scale-up operations performed by CA. This includes both getting error from cloud provider and new nodes failing to boot up and register within timeout. It does not include reaching maximum cluster size (as CA doesn't attempt scale-up at all in that case).scaled_down_nodes_total
counts the number of nodes removed by CA. Possible scale down reasons areempty
,underutilized
,unready
.scaled_up_gpu_nodes_total
counts the number of GPU-enabled nodes successfully added by CA, similar toscaled_up_nodes_total
. Additionallygpu_name
specifies name of the GPU (e.g. nvidia-tesla-k80).scaled_down_gpu_nodes_total
counts the number of nodes removed by CA. Scale down reasons are identical toscaled_down_nodes_total
,gpu_name
toscaled_up_gpu_nodes_total
.skipped_scale_events_count
counts the number of times that the autoscaler has declined to scale a node group because of a resource limit being reached or similar internal event. Scale direction can be eitherup
ordown
, and the reason explains why the scaling was skipped (egCPULimitReached
,MemoryLimitReached
). This is different than failed scaling events in that the autoscaler is choosing not to perform a scaling action.
This metrics describe operations and state related to Node Autoprovisioning feature.
Metric name | Metric type | Labels | Description |
---|---|---|---|
nap_enabled | Gauge | Whether or not Node Autoprovisioning is enabled. 1 if it is, 0 otherwise. | |
created_node_groups_total | Counter | Number of node groups created by Node Autoprovisioning. | |
deleted_node_groups_total | Counter | Number of node groups deleted by Node Autoprovisioning. |