Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Prometheus: log public network availability #301

Merged

Conversation

Stadicus
Copy link
Collaborator

@Stadicus Stadicus commented Dec 5, 2019

fixes https://github.com/shiftdevices/bitbox-base-internal/issues/370

Prometheus should log the availability of public internet without
leaking privacy information.

This pull requests queries an external host with a lot of general
traffic, Cloudflare, over Tor:

curl --socks5-hostname localhost:9050 1.1.1.1

If Tor is not active, it's impossible to ping an external host without
revealing the own ip address, but the solution can just "mingle in the
crowd", e.g. by pinging Cloudflare, which should not be suspicious.

ping -c 1 1.1.1.1

These queries are run regularly from prometheus-base.py, so that the
result is collected by Prometheus and stored in its time-series database.
This helps with analyzing/debugging incidents, as the public network
availability can also be queried after the fact.

This commit:

  • provides the new Prometheus metric 'base_internet_connectivity' that
    is 0 when OK, or an error code when NOT OK

https://github.com/shiftdevices/bitbox-base-internal/issues/370

Prometheus should log the availability of public internet without
leaking privacy information.

This pull requests queries an external host with a lot of general
traffic, Cloudflare, over Tor:

```
curl --socks5-hostname localhost:9050 1.1.1.1
```

If Tor is not active, it's impossible to ping an external host without
revealing the own ip address, but the solution can just "mingle in the
crowd", e.g. by `ping`ing Cloudflare, which should not be suspicious.

```
ping -c 1 1.1.1.1
```

These queries are run regularly from `prometheus-base.py`, so that the
result is collected by Prometheus and stored in its time-series database.
This helps with analyzing/debugging incidents, as the public network
availability can also be queried after the fact.

This commit:
* provides the new Prometheus metric 'base_internet_connectivity' that
  is 0 when OK, or an error code when NOT OK
@Stadicus Stadicus requested a review from 0xB10C December 5, 2019 12:59
@Stadicus Stadicus self-assigned this Dec 5, 2019

except subprocess.CalledProcessError as e:
print("getInternetConnectivity(): subprocess.CalledProcessError (", e.returncode, "); torEnabled", torEnabled, e.output)
return e.returncode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provides the new Prometheus metric 'base_internet_connectivity' that is 0 when OK, or an error code when NOT OK

I think that's a bit unintuitive to use 0 as OK and everything else as NOK. Either rename base_internet_connectivity to base_no_internet_connectivity or even better: return 1 if internet and 0 if no internet connectivity. Does curl or ping even return detailed info in the exit code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not very intuitive and will change it.

I followed the logic of getSystemdStatus, which returns 0 if service is running. Maybe change this as well? Not sure, where that code is used elsewhere, in Supervisor or Middleware.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit f64a68b changes logged values for both "network connectivity" and "systemd units" and updates the Grafana dashboard to reflect these changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-merge nACK from my side.
I really disagree that it's unintuitive to use 0 as OK and everything else as NOK, afaik that is the default convention in most systems. Definitely in all UNIX/POSIX systems and in C running on these systems. (https://en.wikipedia.org/wiki/Exit_status)
I'd say it's very unintuitive and confusing to flip these around and use 0 for not OK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I agree if we look at the metric as a exit-code, this is how it was originally implemented. We switched to view the value more as true/false or up/down, where this logic makes more sense.

I'm happy to revisit that once we extend this, but it's not a critical part of the system just now.

related to: BitBoxSwiss#301 (comment)

Because:
* The Base Prometheus scraper used the `systemctl` error code to
  indicate if a service is active: `0` == OK; otherwise NOT OK
* This is not intuitive, especially when extending the monitoring to
  other status, like the network connectivity in this pull request.
* It's better to use `1` as OK and `0` as NOT OK, which also makes
  visualization in Grafana easier.

This commit:
* changes logged values for network connectivity, `1` meaning OK
* changes logged values for systemd units, `1` meaning OK
* adjusts Grafana dashboard to use new values
* extends Grafana dashboard with diskspace usage over time
  (long overdue)
@Stadicus Stadicus requested a review from 0xB10C December 7, 2019 14:51
@Stadicus Stadicus merged commit 5ce26b0 into BitBoxSwiss:master Dec 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants