-
Notifications
You must be signed in to change notification settings - Fork 28
Prometheus: log public network availability #301
Prometheus: log public network availability #301
Conversation
https://github.com/shiftdevices/bitbox-base-internal/issues/370 Prometheus should log the availability of public internet without leaking privacy information. This pull requests queries an external host with a lot of general traffic, Cloudflare, over Tor: ``` curl --socks5-hostname localhost:9050 1.1.1.1 ``` If Tor is not active, it's impossible to ping an external host without revealing the own ip address, but the solution can just "mingle in the crowd", e.g. by `ping`ing Cloudflare, which should not be suspicious. ``` ping -c 1 1.1.1.1 ``` These queries are run regularly from `prometheus-base.py`, so that the result is collected by Prometheus and stored in its time-series database. This helps with analyzing/debugging incidents, as the public network availability can also be queried after the fact. This commit: * provides the new Prometheus metric 'base_internet_connectivity' that is 0 when OK, or an error code when NOT OK
|
||
except subprocess.CalledProcessError as e: | ||
print("getInternetConnectivity(): subprocess.CalledProcessError (", e.returncode, "); torEnabled", torEnabled, e.output) | ||
return e.returncode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provides the new Prometheus metric 'base_internet_connectivity' that is 0 when OK, or an error code when NOT OK
I think that's a bit unintuitive to use 0 as OK and everything else as NOK. Either rename base_internet_connectivity
to base_no_internet_connectivity
or even better: return 1 if internet and 0 if no internet connectivity. Does curl or ping even return detailed info in the exit code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not very intuitive and will change it.
I followed the logic of getSystemdStatus
, which returns 0
if service is running. Maybe change this as well? Not sure, where that code is used elsewhere, in Supervisor or Middleware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit f64a68b changes logged values for both "network connectivity" and "systemd units" and updates the Grafana dashboard to reflect these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-merge nACK from my side.
I really disagree that it's unintuitive to use 0 as OK and everything else as NOK
, afaik that is the default convention in most systems. Definitely in all UNIX/POSIX systems and in C running on these systems. (https://en.wikipedia.org/wiki/Exit_status)
I'd say it's very unintuitive and confusing to flip these around and use 0 for not OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I agree if we look at the metric as a exit-code
, this is how it was originally implemented. We switched to view the value more as true
/false
or up
/down
, where this logic makes more sense.
I'm happy to revisit that once we extend this, but it's not a critical part of the system just now.
related to: BitBoxSwiss#301 (comment) Because: * The Base Prometheus scraper used the `systemctl` error code to indicate if a service is active: `0` == OK; otherwise NOT OK * This is not intuitive, especially when extending the monitoring to other status, like the network connectivity in this pull request. * It's better to use `1` as OK and `0` as NOT OK, which also makes visualization in Grafana easier. This commit: * changes logged values for network connectivity, `1` meaning OK * changes logged values for systemd units, `1` meaning OK * adjusts Grafana dashboard to use new values * extends Grafana dashboard with diskspace usage over time (long overdue)
fixes https://github.com/shiftdevices/bitbox-base-internal/issues/370
Prometheus should log the availability of public internet without
leaking privacy information.
This pull requests queries an external host with a lot of general
traffic, Cloudflare, over Tor:
If Tor is not active, it's impossible to ping an external host without
revealing the own ip address, but the solution can just "mingle in the
crowd", e.g. by
ping
ing Cloudflare, which should not be suspicious.These queries are run regularly from
prometheus-base.py
, so that theresult is collected by Prometheus and stored in its time-series database.
This helps with analyzing/debugging incidents, as the public network
availability can also be queried after the fact.
This commit:
is 0 when OK, or an error code when NOT OK