-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cached metrics makes us crazy #234
Comments
Hello Luca! Unfortunately I don't have enough information to help you with this one. Could you produce a repo to minimally reproduce the issue that you are seeing? Do you think it is a problem with the plug that you created as opposed to PromEx the library? A shot in the dark guess would be that you want to use a gauge (via last_value https://hexdocs.pm/telemetry_metrics/Telemetry.Metrics.html#last_value/2) in your plug in if you just want the last value of some measurement. Happy to help, but need some more information :). |
We have an eventstore with events and cursors: so a table like: Every cursor has a set of event types that could read to process them. Table: Cursors We are checking with our plugin how many events are missing to reach the end. So in this case for example the metrics would be: We are querying the DB with a query to check how many events are still available after the last_offset_read. Sometimes the metric is coming back with the same data for a while and then changing. The problem is that we have alert on those metrics and sometimes we have false positive alerts. Do you have any idea about it? The problem is that locally is not happening, meanwhile in staging with all the cursors running (~50 cursors) sometimes the metrics are stuck or totally absent. I'm polling every 10 seconds. I am already using last value
meanwhile this is the function that calculates the metrics:
|
Sometimes also there are few metrics, missing the cursors one. I am thinking: we have 3 pods on k8s, is possible that the PromEx process is just starting on one of that 3 and the call is succeed just on 1 of the 3? |
@fedme it could be, the query on DB was going on timeout sometimes so was raising and error and being stuck for the same reason. It fits technically |
Closing this ticket for now as a release will be cut soon with the ability to not detach the polling job when an error is encountered (example in #236). |
Hi, we have a system with some cursors where we are calculating metrics on how much delays cursors have from last events emitted. We have built a custom plugin for this but the system sometimes it remains stuck with this values in cache and is not going ahead since we restart the kubernetes pods.
What we can do for that?
The text was updated successfully, but these errors were encountered: