-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a rake task to push database metrics to prometheus #2828
Conversation
I often find myself wanting to know "how many published editions are there for this document type?", or "how many published editions do we have for each publishing app?", or things of that nature. Currently these questions can be answered at a point in time by querying a database, but there's no easy way to see how the values change over time. It's also not easy to compare numbers from different databases - for example, is the number of editions in publishing-api's database with publishing_app=whitehall the same of the number of editions in the same state in whitehall? The idea here is to have each publishing app and publishing-api report a metric of "editions_in_database_total". Each publishing app should set the "database" label to be the app name, and set the other labels as consistently as possible with publishing-api. We'll run this rake task as a cron job, and it will push metrics to prometheus via the pushgateway. The rake task also renders the metrics as text, to make debugging and testing easier. The database query in publishing-api is fairly expensive (it has to do a large heap scan of editions, which takes around 30 seconds), so we shouldn't run the cron job more than around once an hour. [Explain analyse output for the query](https://www.pgexplain.dev/plan/466c33ba-3358-4b56-aaaa-8f18a396ec3d) Currently the output of the task looks like this: Found 2103 combinations of labels # TYPE editions_in_database_total gauge # HELP editions_in_database_total Count of editions in various databases labeled by state, document_type etc. editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="aaib_report",publishing_app="specialist-publisher",locale="en"} 34.0 editions_in_database_total{database="publishing-api",state="draft",content_store="draft",document_type="about",publishing_app="whitehall",locale="cy"} 1.0 ... 2103 combinations of labels sounds like a lot, but Prometheus is fine with much larger cardinalities than this, at least according to [this answer](https://stackoverflow.com/a/69167162)
00c7494
to
2ca1473
Compare
To fix the error pushing metrics to the pushgateway
I deployed this to integration and ran the rake task - sure enough, the numbers have appeared in prometheus, and I can use them in Grafana ✨ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - not sure if the error catch and throw was just in for debugging or not
).add(prometheus_registry) | ||
end | ||
rescue StandardError => e | ||
warn e.inspect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Or is there a concern we don't log errors from rake tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time I ran the rake task, it errored because I'd missed a require
statement, and it didn't give me any clue what the problem was in the console logs. So I added this for debugging, saw what the error was, and then thought "we should probably log any errors we see anyway, so I'll leave the rescue in".
I was surprised rake doesn't do this by default though. Maybe there's some command line flag you need to give it for it to tell you about errors?
What was the reasoning behind use push gateway rather than implementing a gauge that can be scraped at /metrics? |
In alphagov/publishing-api#2828 we added a new editions_in_database_total metric, pushed to prometheus hourly from publishing-api (ideally eventually other databases too) This dashboard shows the current counts of editions in various states, document types and locales. We show the data on both linear and logarithmic bar charts. The counts are usually exponentially decreasing, so it can be hard to visualise them on a linear scale. On the other hand, people don't always understand logarithms, so it feels like we should have both. This is currently deployed as an experiment here: https://grafana.eks.production.govuk.digital/d/adtm5yljwh534f/richard-towersat-publishing-api-database-metrics
Because it's coming from the database, it's not specific to an instance / pod, so we'd end up with multiple series, which we'd then have to merge using a reporting rule (or otherwise be very careful about things like The other thing is that the SQL query is pretty slow (about 30 seconds), so we can't put it on the rendering path of a scrape response. We'd have to have some kind of background loop running the query on an interval and writing to the gauge, and it's not immediately obvious how best to do this. The prometheus best practices guide on pushing metrics does its best to warn people off pushing metrics, but I think we're okay by this definition:
All that said, I'd be happy to re-implement it as a scrapable metric if we can think of a good pattern for that though. |
I often find myself wanting to know "how many published editions are there for this document type?", or "how many published editions do we have for each publishing app?", or things of that nature.
Currently these questions can be answered at a point in time by querying a database, but there's no easy way to see how the values change over time.
It's also not easy to compare numbers from different databases - for example, is the number of editions in publishing-api's database with publishing_app=whitehall the same of the number of editions in the same state in whitehall?
The idea here is to have each publishing app and publishing-api report a metric of "editions_in_database_total". Each publishing app should set the "database" label to be the app name, and set the other labels as consistently as possible with publishing-api.
We'll run this rake task as a cron job, and it will push metrics to prometheus via the pushgateway. The rake task also renders the metrics as text, to make debugging and testing easier.
The database query in publishing-api is fairly expensive (it has to do a large heap scan of editions, which takes around 30 seconds), so we shouldn't run the cron job more than around once an hour.
Explain analyse output for the query
Currently the output of the task looks like this:
2103 combinations of labels sounds like a lot, but Prometheus is fine with much larger cardinalities than this, at least according to this answer
This will allow us to produce a Grafana dashboard that does things like this: