-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docker storage check #3787
Add docker storage check #3787
Conversation
I'll be honest, I don't know the difference between overlay and overlay2. Were you able to confirm we intend to support both of them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the conceptual difference between this and #3518? There seems to be some overlap.
return lv[0] | ||
|
||
def exec_cmd(self, cmd_str, task_vars): | ||
return self.module_executor("command", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new checks you can use self.execute_module
instead.
365593b
to
66c072b
Compare
I reached out to @stevekuznetsov and he was able to confirm that we will only be supporting |
return False, "" | ||
|
||
def get_lvs_data(self, task_vars): | ||
lvs_cmd = "/sbin/lvs --select vg_name=docker --select lv_name=docker-pool --report-format json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to look for a core module to use in place of this command, however the closest one I found was http://docs.ansible.com/ansible/lvol_module.html, which does not seem to return the information needed for this check
b1fc67b
to
3225aa0
Compare
562f23a
to
e45d630
Compare
@rhcarvalho @sosiouxme PTAL |
if failed: | ||
return failed, no_data_msg | ||
|
||
if data_percent > float(self.max_thinpool_data_usage_percent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we read max_thinpool_data_usage_percent
from task_vars
, as the tests show, it might be a string... and we can't always convert a string to float (not tested).
We could be a bit more careful and handle TypeError
.
I'd suggest doing that early on when we read from task_vars
, not here.
return False, "" | ||
|
||
def get_lvs_data(self, task_vars): | ||
lvs_cmd = "/sbin/lvs --select vg_name=docker --select lv_name=docker-pool --report-format json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rely on the names here, docker
and docker-pool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, upon looking at other installs, vg_name
and lv_name
seemed to vary slightly.
A point @sdodson brought up was that we should not be concerned about the total size of a logical volume, as it is automatically expanded once it reaches a certain usage threshold. Instead, we should be looking at the total size of the volume group containing the docker LV.
An idea that comes to mind would be to use vgdisplay
(or the more convenient vgs
), iterate through all of the volume groups in the output (in my case, I only had one -- docker
), until we come across a VG with docker
in its name. We can then alter this check to only warn when the VG is nearing capacity rather than a specific logical volume under it. WDYT?
EDIT: At least in my case, I am also able to get the specific VG name for docker LV by checking the contents of /etc/sysconfig/docker-storage-setup
cc @sosiouxme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I did not notice this message. Err... yeah, technically the VG can be named anything and that's probably what we should be looking at. Might want to look in docker config for what it's using. I'll poke around this tomorrow.
def get_lvs_data(self, task_vars): | ||
lvs_cmd = "/sbin/lvs --select vg_name=docker --select lv_name=docker-pool --report-format json" | ||
result = self.exec_cmd(lvs_cmd, task_vars) | ||
failed = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Looks like it is only used in the last line of the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will remove. was leftover after I updated this method
failed = False | ||
|
||
if result.get("failed", False): | ||
return True, {}, result.get("msg", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns 3 non-obvious values and they do undocumented...
I'm guessing "did it fail?", "the lvs data I asked for" and "some error message if any". It's a weird API to have failed as the first argument, and might even be unnecessary to have it.
Alternatively, these methods could return what they are meant to return, and if not raise an exception, that would remove a lot of the conditionals.
"""Check Docker storage driver compatibility. | ||
|
||
This check ensures that Docker is using a supported storage driver, | ||
and that Loopback is not being used (if using devicemapper). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Written version of a mental note: this is another check (after memory and disk checks) that will annoy people creating toy clusters...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're setting up "toy clusters" with multiple nodes and openshift-ansible, my opinion is that you're out of the "toy cluster" space -- that's more for oc cluster up
.
def is_active(cls, task_vars): | ||
"""Skip non-containerized installations.""" | ||
is_containerized = get_var(task_vars, "openshift", "common", "is_containerized") | ||
return super(DockerStorageDriver, cls).is_active(task_vars) and is_containerized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like both checks ask for a
class ...(Containerized, OpenShiftCheck):
...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by this, should we create a new mixin rather than read the is_containerized
value from task_vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I meant, for symmetry with NonContainerized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno about a Containerized
mixin but I think this check should be active under the same conditions as the other one.
|
||
if not self.is_supported_storage_driver(info): | ||
msg = "Unsupported Docker storage driver detected. Supported storage drivers: {drivers}" | ||
return {"failed": True, "msg": msg.format(drivers=self.storage_drivers)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would print something like ...Supported storage drivers: ['foo', 'bar']
.
We can make it nicer with ', '.join(self.storage_drivers)
.
msg = "Use of loopback devices is discouraged. Try running Docker with `--storage-opt dm.thinpooldev`" | ||
return {"failed": True, "msg": msg} | ||
|
||
return {"changed": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't focus on the changed
behavior (we not always return it), maybe it would be more consistent to simply don't ever mention it?
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of empty lines?
"""Check Docker storage sanity. | ||
|
||
This check ensures that thinpool usage is not | ||
close to 100% for a containerized installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The percentage threshold is arbitrary, this comment is too specific.
437ce63
to
c876e8d
Compare
c876e8d
to
b89bb5d
Compare
aos-ci-test |
b89bb5d
to
6859392
Compare
1d2dc64
to
c4c75b8
Compare
|
||
@classmethod | ||
def is_active(cls, task_vars): | ||
"""Skip non-containerized installations.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we not do this for non-containerized installations too? Everything that's a node needs supportable Docker storage.
c4c75b8
to
aa19712
Compare
@sosiouxme thanks for the feedback. Review comment addressed |
pyunit test is failing right now. |
@rhcarvalho any concerns with this or are we clear to merge? |
Taking a look now. |
if get_var(task_vars, "openshift", "common", "is_atomic"): | ||
return "", False, False | ||
|
||
# NOTE: we would like to use "package" module but that fails in a way we can't debug just yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's because "package" is not a module but yet another action plugin (or both?). I've run into this before, looks like the method we invoke to call a module can only resolve to actual modules and not any action plugin, which is a pity.
What we could try is to import and run the action plugin directly.
"Unable to install required packages on this host:\n" | ||
" {deps}\n{msg}" | ||
).format(deps=',\n '.join(self.dependencies), msg=msg) | ||
return msg, result.get("failed") or result.get("rc", 0) != 0, result.get("changed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding one of two variables here would make this more readable.
# ...
failed = result.get("failed") or result.get("rc", 0) != 0
changed = result.get("changed")
return msg, failed, changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@juanvallejo @sosiouxme please update the PR description to include the "bonus" that is coming additional to the new health check ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the new check going to bother people on "non-production clusters"? E.g. oc cluster up
style. Do we have a way to disable it?
info = info["info"] | ||
|
||
# check if the storage driver we saw is valid | ||
driver = info.get("Driver", "[NONE]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "[NONE]"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the docker info came back without a driver (which seems unlikely, but you know, code defensively) then we're going to say to the user "Docker is using the X driver which isn't supported". What should X be? I don't really care, just thought it would be good to report something distinctly bogus in this unlikely event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
) | ||
return {"failed": True, "changed": changed, "msg": msg} | ||
result = self._check_dm_usage(driver_status, task_vars) | ||
result["changed"] = changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be something like:
result['changed'] = result.get('changed', False) or changed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only if _check_dm_usage
ever made changes. it doesn't so... keep it simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get we (don't / might not) even use the value of 'changed' for checks right now, but we "code defensively" in lots of places, why wouldn't we do it here? If that assumption changes (that _check_dm_usage
never sets the 'changed'
key), it is unlikely that we'll go look at every caller to stop overriding the value.
metadata_total=driver_status.get("Metadata Space Total"), | ||
) | ||
|
||
for key, value in vals.copy().items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why .copy()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we're modifying vals
in the loop and I heard it was a bad idea to iterate while changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct...
On a second look, I wonder why we mutate vals
, and why do we add a key with the _bytes
prefix? I couldn't find anywhere that actually reads back that key?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see I see... we later read "data_used_bytes" and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and vals is merged into the result, so that the details are all in the ansible spew if you want to debug what you got
an LVM2 LV as a backing storage pool. The LV is "thin" because it does not use | ||
all available storage from its VG, instead expanding as needed; so in considering | ||
available space, it is necessary to include space remaining in the VG (which is | ||
unfortunately not reported by the storage driver in docker info). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a rant, and refers to things that can easily become false statements or be hard to follow.
E.g.:
- "What is 'docker info'?" -- I think this is about our custom Ansible module, but a reader of this code may have no idea what that refers too. In particular, it is NOT about the
docker info
CLI command. - "unfortunately not reported..." -- might be not reported now, but when people will read this, we can't guarantee that is still true. Instead of talking about what something does or doesn't today, focus on what is required. In the future, someone might understand the requirements and find a way how to implemented it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just trying to give background on how this works and why we're looking for the values we are, which I didn't know myself until recently. So, sounds like it could be improved.
except ValueError as err: | ||
return { | ||
"failed": True, | ||
"msg": "Unable to convert thinpool data usage limit to float: {}".format(str(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, with this we cannot tell if the problem was within max_thinpool_data_usage_percent
or max_thinpool_metadata_usage_percent
.
Also, users may not care about "float". Perhaps a better message would explain that the value in the configuration is invalid (and include the bad value in the message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just lazy, didn't want to duplicate the entire try/except for two almost identical things. I will rework this.
"is higher than threshold {:.1f}." | ||
).format(vals["metadata_pct_used"], vals["metadata_threshold"]) | ||
|
||
vals["msg"] = msg or "Thinpool usage is below thresholds." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include the configured threshold and offending value in the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the success message. Which BTW the user doesn't see unless they run with -vvv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, didn't realize that... I'm used to see a 'msg'
key accompanying a 'failed': True
, didn't think of the other case ;-)
match = re.match(r'((?:[^-]|--)+)-(?!-)', pool) # matches up to the first single hyphen | ||
if not match: | ||
raise OpenShiftCheckException( | ||
"Host's docker says it is using LVM storage pool with invalid name '{}'".format(pool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes it valid / invalid?!
"docker says" sounds strange to me ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough :)
This "should not happen" and it would be hard to see how it would be the user's fault if it does, so I blamed it on docker...
if not match: | ||
raise ValueError("Cannot convert to a byte size: " + string) | ||
number, unit = match.groups() | ||
return float(number) * units.get(str(unit).lower(), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the unit is anything that is not in the units
dict we may incorrectly assume it is bytes.
I'd rather check explicitly. Empty = bytes, one of the units in the dicts = multiplication factor, otherwise error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -0,0 +1,214 @@ | |||
import pytest | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many blank lines, should be at most one separating groups of imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have to pylint test files :)
assert DockerStorage.is_active(task_vars=task_vars) == is_active | ||
|
||
|
||
non_atomic_tv = {"openshift": {"common": {"is_atomic": False}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does tv
stand for?! Ah, task_vars!
Please let's use the full name. It's enough cognitive load to have to know what is "task_vars", let alone figure out _tv
-> _task_vars
on something that is used only once...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Yes this will annoy people on dev clusters where they likely will have devicemapper backed by a loopback device since it's the default setup. Once we include it in the install-time checks they can disable it the same as any check. |
d63d245
to
bace709
Compare
Incorporated docker_storage_driver into docker_storage as both need driver info. Corrected storage calculation to include VG free space, not just the current amount in the LV pool. Now makes no assumptions about pool name. Improved user messaging. Factored out some methods that can be shared with docker_image_availability.
bace709
to
a092dff
Compare
@rhcarvalho review issues hopefully addressed |
aos-ci-test |
This avoids unintentionally overriding the value from `True` to `False`.
return msg, result.get("failed") or result.get("rc", 0) != 0, result.get("changed") | ||
failed = result.get("failed", False) or result.get("rc", 0) != 0 | ||
changed = result.get("changed", False) | ||
return msg, failed, changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, i see i missed doing this one. thanks.
aos-ci-test |
|
||
def _check_dm_usage(self, driver_status, task_vars): | ||
""" | ||
Backing assumptions: We expect devicemapper to be backed by an auto-expanding thin pool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describing the assumptions is great, but what I'm missing here is some more standard info we normally have in a docstring:
https://www.python.org/dev/peps/pep-0257/#one-line-docstrings
- The docstring is a phrase ending in a period. It prescribes the function or method's effect as a command ("Do this", "Return that"), not as a description; e.g. don't write "Returns the pathname ...".
...
The docstring for a function or method should summarize its behavior and document its arguments, return value(s), side effects, exceptions raised, and restrictions on when it can be called (all if applicable). Optional arguments should be indicated. It should be documented whether keyword arguments are part of the interface.
Looking at the code I see we construct a dict
, modify it and return it... maybe we're just missing a line in the beginning, something like:
"""Check whether data and metadata volume usages are below a given threshold and return ...
[Continue with the assumptions...]
"""
[merge] |
Evaluated for openshift ansible merge up to 3912fa8 |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/537/) (Base Commit: bbeaa20) |
Adds docker storage check to ensure that a docker host is using a supported storage driver,
that its thinpool usage is not close to a specified limit, and that Loopback device is not being used (if devicemapper).
Also updates
docker_image_availability
to reuse code for running only on docker hosts and installingdocker-py
if neededcc @brenton @rhcarvalho @sosiouxme