Prometheus: log public network availability #301

Stadicus · 2019-12-05T12:59:03Z

fixes https://github.com/shiftdevices/bitbox-base-internal/issues/370

Prometheus should log the availability of public internet without
leaking privacy information.

This pull requests queries an external host with a lot of general
traffic, Cloudflare, over Tor:

curl --socks5-hostname localhost:9050 1.1.1.1

If Tor is not active, it's impossible to ping an external host without
revealing the own ip address, but the solution can just "mingle in the
crowd", e.g. by pinging Cloudflare, which should not be suspicious.

ping -c 1 1.1.1.1

These queries are run regularly from prometheus-base.py, so that the
result is collected by Prometheus and stored in its time-series database.
This helps with analyzing/debugging incidents, as the public network
availability can also be queried after the fact.

This commit:

provides the new Prometheus metric 'base_internet_connectivity' that
is 0 when OK, or an error code when NOT OK

https://github.com/shiftdevices/bitbox-base-internal/issues/370 Prometheus should log the availability of public internet without leaking privacy information. This pull requests queries an external host with a lot of general traffic, Cloudflare, over Tor: ``` curl --socks5-hostname localhost:9050 1.1.1.1 ``` If Tor is not active, it's impossible to ping an external host without revealing the own ip address, but the solution can just "mingle in the crowd", e.g. by `ping`ing Cloudflare, which should not be suspicious. ``` ping -c 1 1.1.1.1 ``` These queries are run regularly from `prometheus-base.py`, so that the result is collected by Prometheus and stored in its time-series database. This helps with analyzing/debugging incidents, as the public network availability can also be queried after the fact. This commit: * provides the new Prometheus metric 'base_internet_connectivity' that is 0 when OK, or an error code when NOT OK

armbian/base/scripts/prometheus-base.py

0xB10C · 2019-12-06T11:08:30Z

armbian/base/scripts/prometheus-base.py

+
+    except subprocess.CalledProcessError as e:
+        print("getInternetConnectivity(): subprocess.CalledProcessError (", e.returncode, "); torEnabled", torEnabled, e.output)
+        return e.returncode


provides the new Prometheus metric 'base_internet_connectivity' that is 0 when OK, or an error code when NOT OK

I think that's a bit unintuitive to use 0 as OK and everything else as NOK. Either rename base_internet_connectivity to base_no_internet_connectivity or even better: return 1 if internet and 0 if no internet connectivity. Does curl or ping even return detailed info in the exit code?

I agree that it's not very intuitive and will change it.

I followed the logic of getSystemdStatus, which returns 0 if service is running. Maybe change this as well? Not sure, where that code is used elsewhere, in Supervisor or Middleware.

Commit f64a68b changes logged values for both "network connectivity" and "systemd units" and updates the Grafana dashboard to reflect these changes.

Post-merge nACK from my side.
I really disagree that it's unintuitive to use 0 as OK and everything else as NOK, afaik that is the default convention in most systems. Definitely in all UNIX/POSIX systems and in C running on these systems. (https://en.wikipedia.org/wiki/Exit_status)
I'd say it's very unintuitive and confusing to flip these around and use 0 for not OK

As discussed offline, I agree if we look at the metric as a exit-code, this is how it was originally implemented. We switched to view the value more as true/false or up/down, where this logic makes more sense.

I'm happy to revisit that once we extend this, but it's not a critical part of the system just now.

related to: BitBoxSwiss#301 (comment) Because: * The Base Prometheus scraper used the `systemctl` error code to indicate if a service is active: `0` == OK; otherwise NOT OK * This is not intuitive, especially when extending the monitoring to other status, like the network connectivity in this pull request. * It's better to use `1` as OK and `0` as NOT OK, which also makes visualization in Grafana easier. This commit: * changes logged values for network connectivity, `1` meaning OK * changes logged values for systemd units, `1` meaning OK * adjusts Grafana dashboard to use new values * extends Grafana dashboard with diskspace usage over time (long overdue)

Stadicus requested a review from 0xB10C December 5, 2019 12:59

Stadicus self-assigned this Dec 5, 2019

0xB10C mentioned this pull request Dec 6, 2019

Add state and description codes: Part 2 #305

Open

0xB10C suggested changes Dec 6, 2019

View reviewed changes

Stadicus requested a review from 0xB10C December 7, 2019 14:51

Stadicus merged commit 5ce26b0 into BitBoxSwiss:master Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus: log public network availability #301

Prometheus: log public network availability #301

Stadicus commented Dec 5, 2019

0xB10C Dec 6, 2019

Stadicus Dec 6, 2019

Stadicus Dec 7, 2019

Tomasvrba Dec 11, 2019

Stadicus Dec 11, 2019

Prometheus: log public network availability #301

Prometheus: log public network availability #301

Conversation

Stadicus commented Dec 5, 2019

0xB10C Dec 6, 2019

Choose a reason for hiding this comment

Stadicus Dec 6, 2019

Choose a reason for hiding this comment

Stadicus Dec 7, 2019

Choose a reason for hiding this comment

Tomasvrba Dec 11, 2019

Choose a reason for hiding this comment

Stadicus Dec 11, 2019

Choose a reason for hiding this comment