Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outage because of "JsonRpcProvider failed to detect network and cannot start up; retry in 1s" #312

Closed
bajtos opened this issue Aug 9, 2024 · 5 comments
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Aug 9, 2024

While investigating CheckerNetwork/node#569, I noticed that spark-evaluate logs are full of the following error messages:

2024-08-09T07:19:31Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:21:42Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:23:53Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:26:04Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:28:15Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:30:26Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:32:37Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:34:48Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:36:59Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:39:10Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:41:21Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:43:32Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)

I think that this brought down the spark-evaluate service.

How can we detect this problem and send an alert to Slack?

What higher-level metric is affected? A bunch of charts in the Internal Spark Dasboard don't show any data points after 2024-08-08 14:08.

Screenshot 2024-08-09 at 09 53 33 Screenshot 2024-08-09 at 09 53 45

Can we create a new metric similar to "unpublished measurements max age" but for round evaluations and trigger an alert when there is no round evaluation posted in >30 minutes?

If that's not possible, then a last-resort option is to create a Papertrail filter to detect these error messages and trigger an alert. This can be too noisy, though.

@bajtos
Copy link
Member Author

bajtos commented Aug 9, 2024

Related:

That other issue should reduce the probability that the JsonRpc problem brings down Spark evaluations. However, I think it's crucial to improve our monitoring to catch these problems, even if we make changes to make them less likely to happen.

@juliangruber
Copy link
Member

Screenshot 2024-08-30 at 13 21 08

We're tracking scheduled rewards of core-fly in Grafana. I'm going to try creating an alert for when the value doesn't increase.

This isn't the full story, but it's a start

@juliangruber
Copy link
Member

Screenshot 2024-08-30 at 14 05 52

@juliangruber
Copy link
Member

Can we create a new metric similar to "unpublished measurements max age" but for round evaluations and trigger an alert when there is no round evaluation posted in >30 minutes?

We have this Influx data: https://github.com/filecoin-station/spark-evaluate/blob/18ce3d90a1988a0b92aae7e8f1e669d7100b8892/lib/evaluate.js#L182-L194. I'm going to create an alert for when it doesn't produce any data for more than 30 minutes

@juliangruber
Copy link
Member

I created an alert that triggers

  • when there's no evaluate Influx data for 30+ minutes
  • or if the total_nodes count is 0

The 2nd condition wasn't necessary for this issue, but a good idea as I was already in that UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ done
Development

No branches or pull requests

2 participants