Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add add Automated Dashboard for Kubernetes Node metrics #64

Merged
merged 1 commit into from
Apr 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions dashboards/kubernetes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Kubernetes automated dashboard

There is currently one automated dashboard for AppSignal for Kubernetes:

- [Nodes](#nodes)

## Nodes

The Nodes dashboard uses Kubernetes node metrics extracted from the `/api/v1/nodes/<NODE>/proxy/stats/summary` API endpoint via [AppSignal for Kubernetes](https://github.com/appsignal/appsignal-kubernetes).
The following metrics are reported through this automated dashboard:

- node_cpu_usage_nano_cores
- node_memory_usage_bytes
- node_memory_available_bytes
- node_swap_available_bytes
- node_swap_usage_bytes
- node_fs_available_bytes
- node_fs_used_bytes
- node_network_rx_bytes
- node_network_tx_bytes
- node_fs_inodes
- node_fs_inodes_free
- node_fs_inodes_used
- node_rlimit_maxpid
- node_rlimit_curproc
300 changes: 300 additions & 0 deletions dashboards/kubernetes/node.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
{
"metric_keys": [
"node_cpu_usage_nano_cores"
],
"dashboard": {
"title": "Kubernetes Nodes",
"description": "",
"visuals": [
{
"title": "Node CPU Usage",
"description": "node_cpu_usage_nano_cores",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These descriptions don't explain anything to me. Let's remove them if it's just for testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if we can do human-readable descriptions, let's do those instead!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the internal metric names, so they'll probably make sense to Kubernetes users.

Copy link
Member

@tombruijn tombruijn Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't the names of these metrics anywhere, so I wouldn't be so sure that it's clear. The same way I can't find that you're using the kubernetes metric names, like you say here. In the library I see it being mapped from a JSON struct that doesn't use the same naming: https://github.com/appsignal/appsignal-kubernetes/blob/0b3f39d65ba99622ab3e647e8e4c012ee944baca/src/main.rs#L97-L167

Do you have any links to docs or source code that mentions these metric names?

"line_label": "%name% %node%",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't include the metric name unless there are multiple metrics in a graph, and for those graphs we're better off using tags on one metric. See my other comment.

Suggested change
"line_label": "%name% %node%",
"line_label": "%node%",

"display": "LINE",
"format": "number",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_cpu_usage_nano_cores",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node Memory Usage",
"description": "node_memory_usage_bytes vs node_memory_available_bytes",
"line_label": "%name% %node%",
"display": "LINE",
"format": "size",
"format_input": "byte",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_memory_usage_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
},
{
"name": "node_memory_available_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node Swap",
"description": "node_swap_usage_bytes vs. node_swap_available_bytes",
"line_label": "%name% %node%",
"display": "LINE",
"format": "size",
"format_input": "byte",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_swap_available_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
},
{
"name": "node_swap_usage_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node File System Usage",
"description": "node_fs_used_bytes vs node_fs_available_bytes",
"line_label": "%name% %node%",
"display": "LINE",
"format": "size",
"format_input": "byte",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_fs_available_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
},
{
"name": "node_fs_used_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node Network Traffic Received",
"description": "node_network_rx_bytes",
"line_label": "%name% %node%",
"display": "LINE",
"format": "size",
"format_input": "byte",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_network_rx_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node Network Traffic Transmitted",
"description": "node_network_tx_bytes",
"line_label": "%name% %node%",
"display": "LINE",
"format": "size",
"format_input": "byte",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_network_tx_bytes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node Inodes Usage",
"description": "node_fs_inodes_free & node_fs_inodes_used vs node_fs_inodes",
"line_label": "%name% %node%",
"display": "LINE",
"format": "number",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_fs_inodes",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
},
{
"name": "node_fs_inodes_free",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
},
{
"name": "node_fs_inodes_used",
Comment on lines +215 to +243
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use tags for different states like free and used instead of reporting them as different metrics. We do this for other (host) metrics too. It would help that we don't have to show the full metric name then for every line in the graph, freeing up valuable space in the hover box.

For example:

  • Metric name: node_fs_inodes
  • Tags:
    • state
      • Values:
        • free
        • used

I also see this in some other graphs in this dashboard. We should update those as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to keep this dashboard as close to what's reported from Kubernetes as I can. I think this is a good idea for the future, when we know what we'd like to report exactly, but let's get this out of the door and get users to try it first.

"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
},
{
"title": "Node Resource Limits",
"description": "node_rlimit_curproc vs node_rlimit_maxpid",
"line_label": "%name% %node%",
"display": "LINE",
"format": "number",
"draw_null_as_zero": true,
"metrics": [
{
"name": "node_rlimit_maxpid",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
},
{
"name": "node_rlimit_curproc",
"fields": [
{
"field": "GAUGE"
}
],
"tags": [
{
"key": "node",
"value": "*"
}
]
}
],
"type": "timeseries"
}
]
}
}