Skip to content

Commit

Permalink
docs: add documentation on metrics/dashboards for apps (#1221)
Browse files Browse the repository at this point in the history
## Description

Adds/modifies a docs page detailing how metrics can be collected for an
additional application (outside core) and how to add on additional
dashboards.

This doc is partially being added in response to a support thread
requesting info on how to group dashboards within uds core when adding
new ones on. Related to that grouping setup, I also added support for
annotations on the loki dashboard configmap.

I also moved some of the information around our monitoring setup and why
we made the choices we did into a new dev doc for this purpose. While it
still has valuable information, it's not as important for the end user
to know/understand unless they wanted to dig into why we chose our
approach (in which case the dev docs are a good place for them to be
looking).

## Related Issue

N/A

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Other (security config, docs update, etc)

## Steps to Validate

If you want to test the grouping, that can be done by making a new
bundle (or modifying the standard bundle locally) and confirming that
all the core dashboards end up grouped together.

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor
Guide](https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md)
followed
  • Loading branch information
mjnagel authored Jan 23, 2025
1 parent 004e8b4 commit d9062da
Show file tree
Hide file tree
Showing 4 changed files with 120 additions and 26 deletions.
32 changes: 32 additions & 0 deletions docs/dev/monitoring-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# UDS Core Metrics Scraping Setup

UDS Core leverages Pepr to handle setup of Prometheus scraping metrics endpoints, with the particular configuration necessary to work in a STRICT mTLS (Istio) environment. We handle this via a default scrapeClass in prometheus to add the istio certs. When a monitor needs to be exempt from that tlsConfig a mutation is performed to leverage a plain scrape class without istio certs.

> [!NOTE]
> The setup described below is the current setup that was designed to handle complexities of Istio sidecars with metrics. With ongoing work to move to Istio ambient this setup should be significantly simplified.
## TLS Configuration Setup

Generally it is beneficial to use service and pod monitor resources from existing helm charts where possible as these may have more advanced configuration and options. The UDS monitoring setup ensures that all monitoring resources use a default [`scrapeClass`](https://github.com/prometheus-operator/prometheus-operator/blob/v0.75.1/Documentation/api.md#monitoring.coreos.com/v1.ScrapeClass) configured in Prometheus to handle the necessary `tlsConfig` setup for metrics to work in STRICT Istio mTLS environments (the `scheme` is also mutated to `https` on individual monitor endpoints, see [this doc](https://istio.io/latest/docs/ops/integrations/prometheus/#tls-settings) for details). This setup is the default configuration but individual monitors can opt out of this config in 3 different ways:

1. If the service or pod monitor targets namespaces that are not Istio injected (ex: `kube-system`), Pepr will detect this and mutate these monitors to use an `exempt` scrape class that does not have the Istio certs. Assumptions are made about STRICT mTLS here for simplicity, based on the `istio-injection` namespace label. Without making these assumptions we would need to query `PeerAuthentication` resources or another resource to determine the exact workload mTLS posture.
1. Individual monitors can explicitly set the `exempt` scrape class to opt out of the Istio certificate configuration. This should typically only be done if your service exposes metrics on a PERMISSIVE mTLS port.
1. If setting a `scrapeClass` is not an option due to lack of configuration in a helm chart, or for other reasons, monitors can use the `uds/skip-mutate` annotation (with any value) to have Pepr mutate the `exempt` scrape class onto the monitor.

> [!NOTE]
> There is a deprecated functionality in Pepr that will mutate `tlsConfig` onto individual service monitors, rather than using the scrape class approach. This has been kept in the current code temporarily to prevent any metrics downtime during the switch to `scrapeClass`. In a future release this behavior will be removed to reduce the complexity of the setup and required mutations.
## Notes on Alternative Approaches

In coming up with this feature when targeting the `ServiceMonitor` use case a few alternative approaches were considered but not chosen due to issues with each one. The current spec provides the best balance of a simplified interface compared to the `ServiceMonitor` spec, and a faster/easier reconciliation loop.

### Generation based on service lookup

An alternative spec option would use the service name instead of selectors/port name. The service name could then be used to lookup the corresponding service and get the necessary selectors/port name (based on numerical port). There are however 2 issues with this route:

1. There is a timing issue if the `Package` CR is applied to the cluster before the app chart itself (which is the norm with our UDS Packages). The service would not exist at the time the `Package` is reconciled. We could lean into eventual consistency here, if we implemented a retry mechanism for the `Package`, which would mitigate this issue.
2. We would need an "alert" mechanism (watch) to notify us when the service(s) are updated, to roll the corresponding updates to network policies and service monitors. While this is doable it feels like unnecessary complexity compared to other options.

### Generation of service + monitor

Another alternative approach would be to use a pod selector and port only. We would then generate both a service and servicemonitor, giving us full control of the port names and selectors. This seems like a viable path, but does add an extra resource for us to generate and manage. There could be unknown side effects of generating services that could clash with other services (particularly with istio endpoints). This would otherwise be a relative straightforward approach and is worth evaluating again if we want to simplify the spec later on.
109 changes: 84 additions & 25 deletions docs/reference/configuration/uds-monitoring-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,11 @@
title: Monitoring and Metrics
---

UDS Core leverages Pepr to handle setup of Prometheus scraping metrics endpoints, with the particular configuration necessary to work in a STRICT mTLS (Istio) environment. We handle this via a default scrapeClass in prometheus to add the istio certs. When a monitor needs to be exempt from that tlsConfig a mutation is performed to leverage a plain scrape class without istio certs.
UDS Core deploys Prometheus and Grafana to provide metrics collection and dashboarding. Out of the box all applications in UDS Core will have their metrics collected by Prometheus, with some default dashboards present in Grafana for viewing this data. This document primarily focuses on the integrations and options provided for extending this to monitor any additional applications you would like to deploy.

## TLS Configuration Setup
## Capturing Metrics

Generally it is beneficial to use service and pod monitor resources from existing helm charts where possible as these may have more advanced configuration and options. The UDS monitoring setup ensures that all monitoring resources use a default [`scrapeClass`](https://github.com/prometheus-operator/prometheus-operator/blob/v0.75.1/Documentation/api.md#monitoring.coreos.com/v1.ScrapeClass) configured in Prometheus to handle the necessary `tlsConfig` setup for metrics to work in STRICT Istio mTLS environments (the `scheme` is also mutated to `https` on individual monitor endpoints, see [this doc](https://istio.io/latest/docs/ops/integrations/prometheus/#tls-settings) for details). This setup is the default configuration but individual monitors can opt out of this config in 3 different ways:

1. If the service or pod monitor targets namespaces that are not Istio injected (ex: `kube-system`), Pepr will detect this and mutate these monitors to use an `exempt` scrape class that does not have the Istio certs. Assumptions are made about STRICT mTLS here for simplicity, based on the `istio-injection` namespace label. Without making these assumptions we would need to query `PeerAuthentication` resources or another resource to determine the exact workload mTLS posture.
1. Individual monitors can explicitly set the `exempt` scrape class to opt out of the Istio certificate configuration. This should typically only be done if your service exposes metrics on a PERMISSIVE mTLS port.
1. If setting a `scrapeClass` is not an option due to lack of configuration in a helm chart, or for other reasons, monitors can use the `uds/skip-mutate` annotation (with any value) to have Pepr mutate the `exempt` scrape class onto the monitor.

:::note
There is a deprecated functionality in Pepr that will mutate `tlsConfig` onto individual service monitors, rather than using the scrape class approach. This has been kept in the current code temporarily to prevent any metrics downtime during the switch to `scrapeClass`. In a future release this behavior will be removed to reduce the complexity of the setup and required mutations.
:::

## Package CR `monitor` field

UDS Core also supports generating `ServiceMonitors` and/or `PodMonitors` from the `monitor` list in the `Package` spec. Charts do not always support monitors, so generating them can be useful. This also provides a simplified way for other users to create monitors, similar to the way we handle `VirtualServices` today. A full example of this can be seen below:
There are a few options within UDS Core to collect metrics from your application. Since the prometheus operator is deployed we recommend using the `ServiceMonitor` and/or `PodMonitor` custom resources to capture metrics. These resources are commonly supported in application helm charts and should be used if available. UDS Core also supports generating these resources from the `monitor` list in the `Package` spec, since charts do not always support monitors. This also provides a simplified way for other users to create monitors, similar to the way `VirtualServices` are generated with the `Package` CR. A full example of this can be seen below:

```yaml
...
Expand Down Expand Up @@ -58,21 +46,92 @@ spec:
type: "Bearer"
```
This config is used to generate service or pod monitors and corresponding network policies to setup scraping for your applications. The aforementioned TLS configuration will also apply to these generated monitors, setting a default scrape class unless target namespaces are non-istio-injected.
Due to UDS Core using STRICT Istio mTLS across the cluster, Prometheus is also configured by default to manage properly scraping metrics with STRICT mTLS. This is done primarily by leveraging a default [`scrapeClass`](https://github.com/prometheus-operator/prometheus-operator/blob/v0.75.1/Documentation/api.md#monitoring.coreos.com/v1.ScrapeClass) which provides the correct TLS configuration and certificates to make mTLS connections. The default configuration works in most scenarios since the operator will attempt to auto-detect needs based istio-injection status in each namespace. If this configuration does not work (the main place this may be an issue is metrics being exposed on a PERMISSIVE mTLS port) there are two options for manually opt-ing out of the Istio TLS configuration:
1. Individual monitors can explicitly set the `exempt` scrape class to opt out of the Istio certificate configuration.
1. If setting a `scrapeClass` is not an option due to lack of configuration in a helm chart, or for other reasons, monitors can set the `uds/skip-mutate` annotation (with any value) to have Pepr mutate the `exempt` scrape class onto the monitor.

## Adding Dashboards

Grafana within UDS Core is configured with [a sidecar](https://github.com/grafana/helm-charts/blob/6eecb003569dc41a494d21893b8ecb3e8a9741a0/charts/grafana/values.yaml#L926-L928) that will watch for new dashboards added via configmaps or secrets and load them into Grafana dynamically. In order to have your dashboard added the configmap or secret must be labelled with `grafana_dashboard: "1"`, which is used by the sidecar to watch and collect new dashboards.

Your configmap/secret must have a data key named `<dashboard_file_name>.json`, with a multi-line string of the dashboard json as the value. See the below example for app dashboards created this way:

This spec intentionally does not support all options available with a `PodMonitor` or `ServiceMonitor`. While we may add additional fields in the future, we do not want to simply rebuild these specs since we are handling the complexities of Istio mTLS metrics. The current subset of spec options is based on the common needs seen in most environments.
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-dashboards
namespace: my-app
labels:
grafana_dashboard: "1"
data:
# The value for this key should be your full JSON dashboard
my-dashboard.json: |
{
"annotations": {
"list": [
{
"builtIn": 1,
...
# Helm's Files functions can also be useful if deploying in a helm chart: https://helm.sh/docs/chart_template_guide/accessing_files/
my-dashboard-from-file.json: |
{{ .Files.Get "dashboards/my-dashboard-from-file.json" | nindent 4 }}
```

## Notes on Alternative Approaches
Grafana provides helpful documentation on [how to build dashboards](https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/) via the UI, which can then be [exported as JSON](https://grafana.com/docs/grafana/latest/dashboards/share-dashboards-panels/#export-a-dashboard-as-json) so that they can be captured in code and loaded as shown above.

In coming up with this feature when targeting the `ServiceMonitor` use case a few alternative approaches were considered but not chosen due to issues with each one. The current spec provides the best balance of a simplified interface compared to the `ServiceMonitor` spec, and a faster/easier reconciliation loop.
### Grouping Dashboards

### Generation based on service lookup
Grafana supports creation of folders for dashboards to provide better organization. UDS Core does not utilize folders by default but the sidecar supports simple values configuration to dynamically create and populate folders. The example overrides below show how to set this up and place the UDS Core default dashboards into a uds-core folder:

An alternative spec option would use the service name instead of selectors/port name. The service name could then be used to lookup the corresponding service and get the necessary selectors/port name (based on numerical port). There are however 2 issues with this route:
```yaml
- name: core
repository: ghcr.io/defenseunicorns/packages/uds/core
ref: x.x.x
overrides:
grafana:
grafana:
values:
# This value allows us to specify a grafana_folder annotation to indicate the file folder to place a given dashboard into
- path: sidecar.dashboards.folderAnnotation
value: grafana_folder
# This value configures the sidecar to build out folders based upon where dashboard files are
- path: sidecar.dashboards.provider.foldersFromFilesStructure
value: true
kube-prometheus-stack:
kube-prometheus-stack:
values:
# This value adds an annotation to the defaults dashboards to specify that they should be grouped under a `uds-core` folder
- path: grafana.sidecar.dashboards.annotations
value:
grafana_folder: "uds-core"
loki:
uds-loki-config:
values:
# This value adds an annotation to the loki dashboards to specify that they should be grouped under a `uds-core` folder
- path: dashboardAnnotations
value:
grafana_folder: "uds-core"
```
1. There is a timing issue if the `Package` CR is applied to the cluster before the app chart itself (which is the norm with our UDS Packages). The service would not exist at the time the `Package` is reconciled. We could lean into eventual consistency here, if we implemented a retry mechanism for the `Package`, which would mitigate this issue.
2. We would need an "alert" mechanism (watch) to notify us when the service(s) are updated, to roll the corresponding updates to network policies and service monitors. While this is doable it feels like unnecessary complexity compared to other options.
Dashboards deployed outside of core can then be grouped separately by adding the annotation `grafana_folder` to your configmap or secret, with a value for the folder name you want. For example:

### Generation of service + monitor
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: my-app-dashboards
namespace: my-app
labels:
grafana_dashboard: "1"
annotations:
# The value of this annotation determines the group that your dashboard will be under
grafana_folder: "my-app"
data:
# Your dashboard data here
```

Another alternative approach would be to use a pod selector and port only. We would then generate both a service and servicemonitor, giving us full control of the port names and selectors. This seems like a viable path, but does add an extra resource for us to generate and manage. There could be unknown side effects of generating services that could clash with other services (particularly with istio endpoints). This would otherwise be a relative straightforward approach and is worth evaluating again if we want to simplify the spec later on.
:::note
If using this configuration, any dashboards without a `grafana_folder` annotation will still be loaded in Grafana, but will not be grouped (they will appear at the top level outside of any folders). Also note that new dashboards in UDS Core may also need to be overridden to add the folder annotation, this example represents the current set of dashboards deployed by default.
:::
3 changes: 2 additions & 1 deletion src/loki/chart/templates/loki-dashboards.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ metadata:
namespace: grafana
labels:
grafana_dashboard: "1"
annotations:
{{- toYaml .Values.dashboardAnnotations | nindent 4 }}
data:
grafana-loki-general.json: |
{{ .Files.Get "dashboards/loki-dashboard-quick-search.json" | nindent 4 }}
2 changes: 2 additions & 0 deletions src/loki/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,5 @@ storage:
remoteSelector: {}
remoteNamespace: ""
egressCidr: ""

dashboardAnnotations: {}

0 comments on commit d9062da

Please sign in to comment.