-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outage because of "JsonRpcProvider failed to detect network and cannot start up; retry in 1s" #312
Comments
Related: That other issue should reduce the probability that the JsonRpc problem brings down Spark evaluations. However, I think it's crucial to improve our monitoring to catch these problems, even if we make changes to make them less likely to happen. |
We have this Influx data: https://github.com/filecoin-station/spark-evaluate/blob/18ce3d90a1988a0b92aae7e8f1e669d7100b8892/lib/evaluate.js#L182-L194. I'm going to create an alert for when it doesn't produce any data for more than 30 minutes |
I created an alert that triggers
The 2nd condition wasn't necessary for this issue, but a good idea as I was already in that UI. |
While investigating CheckerNetwork/node#569, I noticed that spark-evaluate logs are full of the following error messages:
I think that this brought down the spark-evaluate service.
How can we detect this problem and send an alert to Slack?
What higher-level metric is affected? A bunch of charts in the Internal Spark Dasboard don't show any data points after
2024-08-08 14:08
.Can we create a new metric similar to "unpublished measurements max age" but for round evaluations and trigger an alert when there is no round evaluation posted in >30 minutes?
If that's not possible, then a last-resort option is to create a Papertrail filter to detect these error messages and trigger an alert. This can be too noisy, though.
The text was updated successfully, but these errors were encountered: