Add a rake task to push database metrics to prometheus #2828

richardTowers · 2024-08-02T11:10:40Z

I often find myself wanting to know "how many published editions are there for this document type?", or "how many published editions do we have for each publishing app?", or things of that nature.

Currently these questions can be answered at a point in time by querying a database, but there's no easy way to see how the values change over time.

It's also not easy to compare numbers from different databases - for example, is the number of editions in publishing-api's database with publishing_app=whitehall the same of the number of editions in the same state in whitehall?

The idea here is to have each publishing app and publishing-api report a metric of "editions_in_database_total". Each publishing app should set the "database" label to be the app name, and set the other labels as consistently as possible with publishing-api.

We'll run this rake task as a cron job, and it will push metrics to prometheus via the pushgateway. The rake task also renders the metrics as text, to make debugging and testing easier.

The database query in publishing-api is fairly expensive (it has to do a large heap scan of editions, which takes around 30 seconds), so we shouldn't run the cron job more than around once an hour.

Explain analyse output for the query

Currently the output of the task looks like this:

Found 2103 combinations of labels
# TYPE editions_in_database_total gauge
# HELP editions_in_database_total Count of editions in various databases labeled by state, document_type etc.
editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="aaib_report",publishing_app="specialist-publisher",locale="en"} 34.0
editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="about",publishing_app="whitehall",locale="cy"} 1.0
...

2103 combinations of labels sounds like a lot, but Prometheus is fine with much larger cardinalities than this, at least according to this answer

This will allow us to produce a Grafana dashboard that does things like this:

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

I often find myself wanting to know "how many published editions are there for this document type?", or "how many published editions do we have for each publishing app?", or things of that nature. Currently these questions can be answered at a point in time by querying a database, but there's no easy way to see how the values change over time. It's also not easy to compare numbers from different databases - for example, is the number of editions in publishing-api's database with publishing_app=whitehall the same of the number of editions in the same state in whitehall? The idea here is to have each publishing app and publishing-api report a metric of "editions_in_database_total". Each publishing app should set the "database" label to be the app name, and set the other labels as consistently as possible with publishing-api. We'll run this rake task as a cron job, and it will push metrics to prometheus via the pushgateway. The rake task also renders the metrics as text, to make debugging and testing easier. The database query in publishing-api is fairly expensive (it has to do a large heap scan of editions, which takes around 30 seconds), so we shouldn't run the cron job more than around once an hour. [Explain analyse output for the query](https://www.pgexplain.dev/plan/466c33ba-3358-4b56-aaaa-8f18a396ec3d) Currently the output of the task looks like this: Found 2103 combinations of labels # TYPE editions_in_database_total gauge # HELP editions_in_database_total Count of editions in various databases labeled by state, document_type etc. editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="aaib_report",publishing_app="specialist-publisher",locale="en"} 34.0 editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="about",publishing_app="whitehall",locale="cy"} 1.0 ... 2103 combinations of labels sounds like a lot, but Prometheus is fine with much larger cardinalities than this, at least according to [this answer](https://stackoverflow.com/a/69167162)

To fix the error pushing metrics to the pushgateway

richardTowers · 2024-08-02T13:43:28Z

I deployed this to integration and ran the rake task - sure enough, the numbers have appeared in prometheus, and I can use them in Grafana ✨

ryanb-gds

Looks good - not sure if the error catch and throw was just in for debugging or not

ryanb-gds · 2024-08-02T15:24:57Z

lib/tasks/metrics.rake

+      ).add(prometheus_registry)
+    end
+  rescue StandardError => e
+    warn e.inspect


Do we need this? Or is there a concern we don't log errors from rake tasks?

First time I ran the rake task, it errored because I'd missed a require statement, and it didn't give me any clue what the problem was in the console logs. So I added this for debugging, saw what the error was, and then thought "we should probably log any errors we see anyway, so I'll leave the rescue in".

I was surprised rake doesn't do this by default though. Maybe there's some command line flag you need to give it for it to tell you about errors?

theseanything · 2024-08-05T08:55:44Z

What was the reasoning behind use push gateway rather than implementing a gauge that can be scraped at /metrics?

In alphagov/publishing-api#2828 we added a new editions_in_database_total metric, pushed to prometheus hourly from publishing-api (ideally eventually other databases too) This dashboard shows the current counts of editions in various states, document types and locales. We show the data on both linear and logarithmic bar charts. The counts are usually exponentially decreasing, so it can be hard to visualise them on a linear scale. On the other hand, people don't always understand logarithms, so it feels like we should have both. This is currently deployed as an experiment here: https://grafana.eks.production.govuk.digital/d/adtm5yljwh534f/richard-towersat-publishing-api-database-metrics

richardTowers · 2024-08-05T13:46:10Z

What was the reasoning behind use push gateway rather than implementing a gauge that can be scraped at /metrics?

Because it's coming from the database, it's not specific to an instance / pod, so we'd end up with multiple series, which we'd then have to merge using a reporting rule (or otherwise be very careful about things like sum()).

The other thing is that the SQL query is pretty slow (about 30 seconds), so we can't put it on the rendering path of a scrape response. We'd have to have some kind of background loop running the query on an interval and writing to the gauge, and it's not immediately obvious how best to do this.

The prometheus best practices guide on pushing metrics does its best to warn people off pushing metrics, but I think we're okay by this definition:

Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics.

All that said, I'd be happy to re-implement it as a scrapable metric if we can think of a good pattern for that though.

richardTowers mentioned this pull request Aug 2, 2024

Report publishing-api metrics to prometheus hourly alphagov/govuk-helm-charts#2236

Merged

richardTowers added 2 commits August 2, 2024 13:30

Sort labels to ensure predictable output for tests

2ca1473

richardTowers force-pushed the push-database-metrics-to-prometheus branch from 00c7494 to 2ca1473 Compare August 2, 2024 12:30

richardTowers added 2 commits August 2, 2024 14:15

Log any errors encountered during the rake task

a7a3a2e

Require prometheus/client/push

3037457

To fix the error pushing metrics to the pushgateway

ryanb-gds approved these changes Aug 2, 2024

View reviewed changes

richardTowers merged commit bb8c9d8 into main Aug 2, 2024
12 checks passed

richardTowers deleted the push-database-metrics-to-prometheus branch August 2, 2024 15:38

richardTowers mentioned this pull request Aug 5, 2024

Add a dashboard showing publishing-api editions alphagov/govuk-helm-charts#2247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a rake task to push database metrics to prometheus #2828

Add a rake task to push database metrics to prometheus #2828

richardTowers commented Aug 2, 2024 •

edited

Loading

richardTowers commented Aug 2, 2024

ryanb-gds left a comment

ryanb-gds Aug 2, 2024

richardTowers Aug 2, 2024

theseanything commented Aug 5, 2024

richardTowers commented Aug 5, 2024

Add a rake task to push database metrics to prometheus #2828

Add a rake task to push database metrics to prometheus #2828

Conversation

richardTowers commented Aug 2, 2024 • edited Loading

richardTowers commented Aug 2, 2024

ryanb-gds left a comment

Choose a reason for hiding this comment

ryanb-gds Aug 2, 2024

Choose a reason for hiding this comment

richardTowers Aug 2, 2024

Choose a reason for hiding this comment

theseanything commented Aug 5, 2024

richardTowers commented Aug 5, 2024

richardTowers commented Aug 2, 2024 •

edited

Loading