Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a rake task to push database metrics to prometheus #2828

Merged
merged 4 commits into from
Aug 2, 2024

Conversation

richardTowers
Copy link
Contributor

@richardTowers richardTowers commented Aug 2, 2024

I often find myself wanting to know "how many published editions are there for this document type?", or "how many published editions do we have for each publishing app?", or things of that nature.

Currently these questions can be answered at a point in time by querying a database, but there's no easy way to see how the values change over time.

It's also not easy to compare numbers from different databases - for example, is the number of editions in publishing-api's database with publishing_app=whitehall the same of the number of editions in the same state in whitehall?

The idea here is to have each publishing app and publishing-api report a metric of "editions_in_database_total". Each publishing app should set the "database" label to be the app name, and set the other labels as consistently as possible with publishing-api.

We'll run this rake task as a cron job, and it will push metrics to prometheus via the pushgateway. The rake task also renders the metrics as text, to make debugging and testing easier.

The database query in publishing-api is fairly expensive (it has to do a large heap scan of editions, which takes around 30 seconds), so we shouldn't run the cron job more than around once an hour.

Explain analyse output for the query

Currently the output of the task looks like this:

Found 2103 combinations of labels
# TYPE editions_in_database_total gauge
# HELP editions_in_database_total Count of editions in various databases labeled by state, document_type etc.
editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="aaib_report",publishing_app="specialist-publisher",locale="en"} 34.0
editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="about",publishing_app="whitehall",locale="cy"} 1.0
...

2103 combinations of labels sounds like a lot, but Prometheus is fine with much larger cardinalities than this, at least according to this answer

This will allow us to produce a Grafana dashboard that does things like this:

image

⚠️ This repo is Continuously Deployed: make sure you follow the guidance ⚠️

I often find myself wanting to know "how many published editions are
there for this document type?", or "how many published editions do we
have for each publishing app?", or things of that nature.

Currently these questions can be answered at a point in time by querying
a database, but there's no easy way to see how the values change over
time.

It's also not easy to compare numbers from different databases - for
example, is the number of editions in publishing-api's database with
publishing_app=whitehall the same of the number of editions in the same
state in whitehall?

The idea here is to have each publishing app and publishing-api report a
metric of "editions_in_database_total". Each publishing app should set
the "database" label to be the app name, and set the other labels as
consistently as possible with publishing-api.

We'll run this rake task as a cron job, and it will push metrics to
prometheus via the pushgateway. The rake task also renders the metrics
as text, to make debugging and testing easier.

The database query in publishing-api is fairly expensive (it has to do a
large heap scan of editions, which takes around 30 seconds), so we
shouldn't run the cron job more than around once an hour.

[Explain analyse output for the query](https://www.pgexplain.dev/plan/466c33ba-3358-4b56-aaaa-8f18a396ec3d)

Currently the output of the task looks like this:

    Found 2103 combinations of labels
    # TYPE editions_in_database_total gauge
    # HELP editions_in_database_total Count of editions in various databases labeled by state, document_type etc.
    editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="aaib_report",publishing_app="specialist-publisher",locale="en"} 34.0
    editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="about",publishing_app="whitehall",locale="cy"} 1.0
    ...

2103 combinations of labels sounds like a lot, but Prometheus is fine
with much larger cardinalities than this, at least according to
[this answer](https://stackoverflow.com/a/69167162)
@richardTowers richardTowers force-pushed the push-database-metrics-to-prometheus branch from 00c7494 to 2ca1473 Compare August 2, 2024 12:30
@richardTowers
Copy link
Contributor Author

I deployed this to integration and ran the rake task - sure enough, the numbers have appeared in prometheus, and I can use them in Grafana ✨

Copy link
Contributor

@ryanb-gds ryanb-gds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - not sure if the error catch and throw was just in for debugging or not

).add(prometheus_registry)
end
rescue StandardError => e
warn e.inspect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Or is there a concern we don't log errors from rake tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I ran the rake task, it errored because I'd missed a require statement, and it didn't give me any clue what the problem was in the console logs. So I added this for debugging, saw what the error was, and then thought "we should probably log any errors we see anyway, so I'll leave the rescue in".

I was surprised rake doesn't do this by default though. Maybe there's some command line flag you need to give it for it to tell you about errors?

@richardTowers richardTowers merged commit bb8c9d8 into main Aug 2, 2024
12 checks passed
@richardTowers richardTowers deleted the push-database-metrics-to-prometheus branch August 2, 2024 15:38
@theseanything
Copy link
Contributor

What was the reasoning behind use push gateway rather than implementing a gauge that can be scraped at /metrics?

richardTowers added a commit to alphagov/govuk-helm-charts that referenced this pull request Aug 5, 2024
In alphagov/publishing-api#2828 we added a new
editions_in_database_total metric, pushed to prometheus hourly from
publishing-api (ideally eventually other databases too)

This dashboard shows the current counts of editions in various states,
document types and locales.

We show the data on both linear and logarithmic bar charts. The counts
are usually exponentially decreasing, so it can be hard to visualise
them on a linear scale. On the other hand, people don't always
understand logarithms, so it feels like we should have both.

This is currently deployed as an experiment here: https://grafana.eks.production.govuk.digital/d/adtm5yljwh534f/richard-towersat-publishing-api-database-metrics
@richardTowers
Copy link
Contributor Author

What was the reasoning behind use push gateway rather than implementing a gauge that can be scraped at /metrics?

Because it's coming from the database, it's not specific to an instance / pod, so we'd end up with multiple series, which we'd then have to merge using a reporting rule (or otherwise be very careful about things like sum()).

The other thing is that the SQL query is pretty slow (about 30 seconds), so we can't put it on the rendering path of a scrape response. We'd have to have some kind of background loop running the query on an interval and writing to the gauge, and it's not immediately obvious how best to do this.

The prometheus best practices guide on pushing metrics does its best to warn people off pushing metrics, but I think we're okay by this definition:

Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics.

All that said, I'd be happy to re-implement it as a scrapable metric if we can think of a good pattern for that though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants