add ram and non-containerized storage preflight check #3518

juanvallejo · 2017-02-27T22:53:08Z

Related Trello card: https://trello.com/c/cLu85dOf/71-uhg-diagnostics-check-memory-matches-requirements?menu=filter&filter=[diagnostics]

Related Trello card: https://trello.com/c/5pr2WcQH/43-diagnostics-additional-pre-flight-checks?menu=filter&filter=[diagnostics]

cc @brenton @rhcarvalho

juanvallejo · 2017-02-27T22:54:04Z

roles/openshift_health_checker/openshift_checks/ram_diskspace_availability.py

+        memoryfree = get_var(task_vars, "facter_memoryfree")
+
+        recommended_memory = get_var(task_vars, "min_memory_gb")
+        recommended_storage = get_var(task_vars, "min_diskspace_gb")


@brenton not sure where these values should be coming from

You can get the supported values here:
https://docs.openshift.com/container-platform/latest/install_config/install/prerequisites.html#hardware

Even though there may be different requirements for older versions of the product, in general I think we should always encourage customers to meet the requirements of the latest version so they don't hit problems when they upgrade.

juanvallejo · 2017-02-27T22:57:05Z

roles/openshift_health_checker/openshift_checks/ram_diskspace_availability.py

+    # any images are missing, or error occurs during pull
+    def run(self, tmp, task_vars):
+        ansible_mounts = get_var(task_vars, "ansible_mounts")
+        diskfree = self.to_gigabytes(self.get_disk_space(ansible_mounts))


@brenton or @rhcarvalho This is the total value of the / filesystem, not sure if I should be checking that, or openshift-xfs-vol-dir. I am getting this information from ansible_mounts

@detiber, do you have any idea where openshift-xfs-vol-dir came from?

Is it a disk label or a lv name? In any case, I'm not sure we can rely on specific values.

I think the best we can do is match a device with specific directories we care about.

For containerized docker info would work for devicemapper w/ a thin pool, not sure about other backends.

For containerized docker info would work for devicemapper w/ a thin pool, not sure about other backends.

Thanks, for now at least, I have added support for checking containerized installs that use devicemapper for storage driver

detiber · 2017-02-28T03:20:17Z

roles/openshift_health_checker/openshift_checks/ram_diskspace_availability.py

+from openshift_checks.mixins import NotContainerizedMixin
+
+
+class RamDiskspaceAvailability(NotContainerizedMixin, OpenShiftCheck):


Should the memory and diskspace checks be separated?

I would think that Memory checks would apply equally to containerized/non-containerized.

Are there plans to also add a disk space check for containerized installs as well?

👍 to splitting into 2 checks.

sosiouxme · 2017-03-01T20:46:31Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+
+    def run(self, tmp, task_vars):
+        group_names = get_var(task_vars, "group_names", default=[])
+        memoryfree = get_var(task_vars, "facter_memoryfree")


I am thinking free memory is not the right measure here, as I believe cache use is excluded, so it's kind of an arbitrarily changing value. Maybe "memory.system.available" however I'm not sure that's in the task_vars and I wonder if we really want to base it on memory usage at time of operation. I think it would be least surprising to use memorysize (total system memory).

sosiouxme · 2017-03-01T20:47:53Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+        group_names = get_var(task_vars, "group_names", default=[])
+        memoryfree = get_var(task_vars, "facter_memoryfree")
+
+        recommended_memory_gb = self.recommended_master_memory_gb


To me this begs for a ternary:

recommended_memory_gb = self.recommended_node_memory_gb if "nodes" in group_names else self.recommended_master_memory_gb

Pythonistas? :)

Also shouldn't this be the other way around? Because masters can be nodes...

Masters should be nodes, but in general they are unschedulable nodes, so they do not require additional disk space.

Pythonistas? :)

I'd go with a dict and max:

class MemoryAvailability(OpenShiftCheck): recommended_memory = { "nodes": X, "masters": Y, "etcd": Z, "nfs": M, "lb": N, } def run(...): # ... recommended_memory = max(self.recommended_memory.get(group, 0) for group in group_names)

sosiouxme · 2017-03-01T20:50:48Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+
+    @staticmethod
+    def mem_to_float(mem):
+        return float(mem.rstrip(" GB"))


What if it's GiB... or MB? I think it's the "display size" which would be MB if < 1 GB.

sosiouxme · 2017-03-01T20:51:47Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

@@ -0,0 +1,31 @@
+# pylint: disable=missing-docstring
+from openshift_checks import OpenShiftCheck, get_var
+from openshift_checks.mixins import NotContainerizedMixin


Why not run this in containerized installs too?

sosiouxme · 2017-03-01T20:56:46Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

@@ -0,0 +1,44 @@
+# pylint: disable=missing-docstring
+from openshift_checks import OpenShiftCheck, get_var
+from openshift_checks.mixins import NotContainerizedMixin


Maybe I'm misunderstanding but doesn't this exclude containerized installs?

This seems to be a leftover, unused import.
Unlike Go, the Python interpreter doesn't complain, but I'm sure this is failing our CI.

sosiouxme · 2017-03-01T21:04:42Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+    def get_disk_space(ansible_mounts, is_containerized):
+        if len(ansible_mounts):
+            for mnt in ansible_mounts:
+                if mnt.get("mount") == "/":


Technically the requirements are for the filesystem containing /var (https://docs.openshift.com/container-platform/3.4/install_config/install/prerequisites.html#hardware). So should check if /var is mounted separately before using /.

detiber · 2017-03-01T03:29:59Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+
+    @staticmethod
+    def to_gigabytes(total_bytes):
+        return total_bytes / 1073741824


Just a nit, but wouldn't this be gibibytes instead of gigabytes?

Thanks, fixed!

detiber · 2017-03-01T03:39:44Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+
+    def run(self, tmp, task_vars):
+        group_names = get_var(task_vars, "group_names", default=[])
+        memoryfree = get_var(task_vars, "facter_memoryfree")


I'm not sure we can guarantee that factor will be available on the remote hosts. ansible_memfree_mb would be the equivalent. That said, I don't think you can trust either factor_memoryfree or ansible_memfree_mb here, since they both would include cached data.

Instead of trying to get fancy, I think it might be best to stick with comparing against ansible_memtotal_mb. There is also ansible_memory_mb that provides access to swap, physical memory info, and memory values adjusted for cached data.

👍 ansible_memtotal_mb

detiber · 2017-03-01T03:43:24Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+        if "nodes" in group_names:
+            recommended_memory_gb = self.recommended_node_memory_gb
+
+        if self.mem_to_float(memoryfree) < recommended_memory_gb:


Would it be better to convert the retrieved value to bytes and do a int comparison instead?

rhcarvalho · 2017-03-02T12:24:18Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

@@ -0,0 +1,44 @@
+# pylint: disable=missing-docstring
+from openshift_checks import OpenShiftCheck, get_var
+from openshift_checks.mixins import NotContainerizedMixin


This seems to be a leftover, unused import.
Unlike Go, the Python interpreter doesn't complain, but I'm sure this is failing our CI.

rhcarvalho · 2017-03-02T12:24:48Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+from openshift_checks import OpenShiftCheck, get_var
+from openshift_checks.mixins import NotContainerizedMixin
+
+import json


Another case of imported and not used.

rhcarvalho · 2017-03-02T12:32:09Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+    tags = ["preflight"]
+
+    recommended_node_diskspace_gb = 15.0
+    recommended_master_diskspace_gb = 40.0


What about things like etcd?

rhcarvalho · 2017-03-02T12:46:35Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+        group_names = get_var(task_vars, "group_names", default=[])
+        memoryfree = get_var(task_vars, "facter_memoryfree")
+
+        recommended_memory_gb = self.recommended_master_memory_gb


Pythonistas? :)

I'd go with a dict and max:

class MemoryAvailability(OpenShiftCheck): recommended_memory = { "nodes": X, "masters": Y, "etcd": Z, "nfs": M, "lb": N, } def run(...): # ... recommended_memory = max(self.recommended_memory.get(group, 0) for group in group_names)

juanvallejo · 2017-03-09T15:52:29Z

@sdodson, @rhcarvalho, or @sosiouxme wondering if you could give this patch one last pass? Thanks

detiber · 2017-03-09T16:10:50Z

roles/openshift_health_checker/library/docker_info.py

+
+
+if __name__ == '__main__':
+    main()


Would it make sense to make this available to other roles? I could see this being of value in the docker-related roles as well. It might also be nice to return back at least some of the values as 'facts' instead of just a single info dict.

Sure, will update this to return its values under ansible_facts

hm, maybe it would be better to rename the module from docker_info to simply docker? That way, it makes sense for it to return not just info, but docker client info.

docker_facts would probably be better, docker would shadow the older (and deprecated) docker module upstream.

detiber · 2017-03-09T16:17:05Z

roles/openshift_health_checker/library/docker_info.py

+def main():
+    client = AnsibleDockerClient(
+        argument_spec=dict(
+            name=dict(type='list'),


name doesn't appear to be used anywhere.

I would add supports_check_mode=True to params for AnsibleDockerClient.

rhcarvalho · 2017-03-27T11:27:54Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+
+    recommended_node_space_gb = 15.0
+    recommended_master_space_gb = 40.0
+    minimum_docker_space_percent = 5.0


These should be configurable in run time.

They may be defined as variables in the preflight/check.yml playbook, and their names should have some prefix to avoid name collisions...

rhcarvalho · 2017-03-27T11:30:06Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+        diskfree = self.to_gibibytes(self.get_disk_space(ansible_mounts))
+
+        # if running containerized installation check
+        # that available space is not below 5% instead


Comments should not refer to specific values like "5%" that are defined in a variable elsewhere.

(multiple occurrences, check all)

rhcarvalho · 2017-03-27T11:31:42Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            info = docker_facts.get("docker_facts")
+            status = dict(info.get("DriverStatus"))
+
+            if info.get("Driver", None) == self.supported_docker_dev_driver:


The behavior of this check doesn't seem to be clearly defined when the driver is not what we consider to be "supported".
We need to clarify the intent.

rhcarvalho · 2017-03-27T11:34:15Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            status = dict(info.get("DriverStatus"))
+
+            if info.get("Driver", None) == self.supported_docker_dev_driver:
+                disktotal = status.get("Data Space Total", None)


What's the guarantee we have on these key names "Data Space Total", etc? I'm afraid those are not guaranteed APIs to rely on.

hm, they do not seem to be guaranteed as far as I have seen.

Another idea to retrieve these values would be to use the docker_facts module to get the value of DockerRootDir (/var/lib/docker by default), and perform df on it:

$ df --output=avail -B G /var/lib/docker | grep "G" -m 1 20G

Simpler yet, we could just use df to get the mountpoint for /var/lib/docker, then look for that in ansible_mounts under task_vars and get available and total values from there.

@rhcarvalho
Upon looking at documentation for devmapper https://github.com/docker/docker/tree/master/daemon/graphdriver/devmapper#information-on-docker-info, the values Data Space Total and Data Space Available appear in the linked readme. Although the values underneath the storage driver vary from driver to driver, since we are first checking that the host machine uses devicemapper, I would prefer to rely on these values

rhcarvalho · 2017-03-27T11:38:18Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+
+            if info.get("Driver", None) == self.supported_docker_dev_driver:
+                disktotal = status.get("Data Space Total", None)
+                diskavail = status.get("Data Space Available", None)


Unless in a context that would otherwise require None for clarity, in Python dictionaries dict.get has a default value of None, so it is more common to write:

some_dict.get("some key")

However, in this case here there is a problem with data types. I'd avoid mixing whatever is that type we expect to find in the dictionary (apparently a str) x NoneType.

rhcarvalho · 2017-03-27T11:40:46Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            "pb": 1000 ** 5
+        }
+
+        segments = strfmt.lower().split(" ")


There can easily be an AttributeError exception here. We use strfmt as if it was a str, but in the code above we may clearly be calling this function with None as the first argument...

In a dynamically typed language, we have to keep our usage of data types sane.

Thanks, I now make sure to pass a string type to this method. Corrected this in other places as well

rhcarvalho · 2017-03-27T11:49:32Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+
+        recommended_memory_gb = max(self.recommended_memory_gb.get(group, 0) for group in group_names)
+
+        if self.to_gigabytes(memoryfree) < recommended_memory_gb:


Why don't we store the recommended values in the unit we want to compare to (MB)?

rhcarvalho · 2017-03-27T11:50:52Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+    recommended_memory_gb = {
+        "nodes": 8.0,
+        "masters": 16.0,
+        "etcd": 20.0


It is good practice to include a comma at the end of every line in multiline definitions like this.
It makes it easier to add / remove / move items without having to make changes to other lines.

rhcarvalho · 2017-03-27T11:52:18Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+        recommended_memory_gb = max(self.recommended_memory_gb.get(group, 0) for group in group_names)
+
+        if self.to_gigabytes(memoryfree) < recommended_memory_gb:
+            kind = "master" if "masters" in group_names else "node"


This is incorrect. There are many other possibilities for group_names, we can't call everything else a "node" (there are etcd, nfs, ...).

Thanks, will fix. Originally referenced this check to determine this logic

The logic there is different, it is additive: start with an empty set and add items based on the group names. Here we are incorrectly assuming that everything that is not master is a node.

rhcarvalho · 2017-03-27T11:56:38Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            if info.get("Driver", None) == self.supported_docker_dev_driver:
+                disktotal = status.get("Data Space Total", None)
+                diskavail = status.get("Data Space Available", None)
+                diskleft_percent = self.to_bytes(diskavail) / self.to_bytes(disktotal) * 100


There could be division by zero here...

rhcarvalho · 2017-03-27T11:57:11Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+                diskleft_percent = self.to_bytes(diskavail) / self.to_bytes(disktotal) * 100
+
+                # fail if less than 5% of available dataspace space left
+                if int(diskleft_percent) < self.minimum_docker_space_percent:


Be mindful of rounding problems here.

rhcarvalho · 2017-03-27T11:58:58Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+                # fail if less than 5% of available dataspace space left
+                if int(diskleft_percent) < self.minimum_docker_space_percent:
+                    msg = "The amount of data space remaining (%s%%) for the docker storage driver is below %s%%" \
+                          % (int(diskleft_percent), self.minimum_docker_space_percent)


We should not use %s (formatting of strings) with int values.

And we should prefer using str.format instead of the % operator.
(multiple occurrences, check all)

rhcarvalho · 2017-03-27T12:01:24Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+        recommended_diskspace_gb = self.recommended_master_space_gb
+
+        if "nodes" in group_names:
+            recommended_diskspace_gb = self.recommended_node_space_gb


This logic is wrong.

There are more group names that masters and nodes.

If "nodes" in group_names, we should not overwrite recommended_diskspace_gb -- what if "masters" is also in group_names?!

rhcarvalho · 2017-03-27T12:02:23Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+        if "nodes" in group_names:
+            recommended_diskspace_gb = self.recommended_node_space_gb
+
+        if float(diskfree) < recommended_diskspace_gb:


Why the conversion to float?!

rhcarvalho · 2017-03-27T12:18:57Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+
+        # if running containerized installation check
+        # that available space is not below 5% instead
+        if is_containerized:


For ease of reading and understanding what this check does, please make the "run" method short and clear on its intents.

For example, we could have run_containerized and run_non_containerized methods to which run delegates depending on is_containerized.

rhcarvalho · 2017-03-27T12:27:37Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            if mnt.get("mount") == "/":
+                root_size_available = mnt.get("size_available")
+            if mnt.get("mount") == "/var":
+                return mnt.get("size_available")


Why do we return early on /var, but keep iterating when we see /?!

We need to clarify the intent here, what is this function promising to return? "get disk space" is not a great name, we can try to improve it once the expected output is clear.

From the recommended available storage, is that 100% towards Docker storage or do we need to look at Docker storage + other mount points?

I return immediately if /var exists at all. If not, / is the fallback mount path. I update this method to be a bit more legible

From the recommended available storage, is that 100% towards Docker storage or do we need to look at Docker storage + other mount points?

I meant for this to be a check on the mount point for the docker volume. At least in my case, the mount point for its volume was /, so I checked for the overall disk space remaining on that point

tbielawa · 2017-03-27T18:28:18Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+        return {"changed": False}
+
+    @staticmethod
+    def get_disk_space(ansible_mounts):


A docstring here would make it clear what the ansible_mounts parameter should be for the function to work properly. It's a list of hashes I guess?

ansible_mounts = [] ansible_mounts.append({'size_available': 12345, 'mount': '/'}) ansible_mounts.append({'size_available': 34512, 'mount': '/var'})

Right? (plus other possible params)

Thanks for the feedback. ansible_mounts is a map like you described. I updated the method to be more legible, added a docstring

rhcarvalho · 2017-03-28T09:49:50Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+    tags = ["preflight"]
+
+    recommended_memory_mb = {
+        "nodes": 8000.0,


8000 or 8 * 1024?

I think 8000 is fine, unless we prefer MiB instead?

Our docs use the decimal prefixes, let's go with powers of 10 then.
https://docs.openshift.org/latest/install_config/install/prerequisites.html

I scanned over the output from docker info and saw that it reports different measurements in different prefix unit systems.

... Storage Driver: devicemapper Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Data Space Used: 1.42 GB Data Space Total: 44.43 GB Data Space Available: 43.01 GB Metadata Space Used: 1.507 MB Metadata Space Total: 511.7 MB Metadata Space Available: 510.2 MB Thin Pool Minimum Free Space: 4.443 GB ... Total Memory: 15.11 GiB ...

Sizes are reported using the SI base-10 system (kB, MB, GB) and memory is reported using the NIST base-2 system (KiB, GiB, etc).

As it was pointed out before, I do maintain a library, 'bitmath' (200 unit tests, in RHEL/Fedora, etc etc, etc) that handles all of that automatically (and more).

Here's a quick gist I made that takes code from this PR and shows the bitmath equivalent.

parse_docker_info.py

example_output.log

I think that trying to get bitmath available as an RPM in the RHSM channels would be a big effort, and getting it into Atomic is nigh impossible. I'm OK w/ vendoring/slipping the bitmath library into this project.

But that's a bigger conversation we should probably have as a team.

edit#2: If you want to try bitmath you can pip install bitmath or dnf/yum install python-bitmath (or depending on your distribution version: python2-bitmath or python3-bitmath)

This is memory though which isn't coming from docker info... comes from ansible setup. Whatever our docs say, nobody measures RAM in GB (only disk manufacturers get to pull that crap), so it seems to me MiB/GiB should be used.

rhcarvalho · 2017-03-28T09:50:09Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+        return {"changed": False}
+
+    @staticmethod
+    def to_gigabytes(size_in_mb):


Thanks, will remove

rhcarvalho · 2017-03-28T09:52:54Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+        recommended_memory_mb = max(self.recommended_memory_mb.get(group, 0) for group in group_names)
+
+        if memoryfree < recommended_memory_mb:
+            msg = "Available memory ({available} GB) below recommended storage. Minimum required memory: {recommended} GB".format(available=memoryfree, recommended=recommended_memory_mb)


Too long line...

The reported units are wrong.

Will fix, taking care of linting issues towards the end

rhcarvalho · 2017-03-28T09:59:29Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+
+        segments = strfmt.lower().split(" ")
+        if not units.get(segments[1], False):
+            return None


Types: why does this function sometimes return a number / float, sometimes NoneType?

Thanks, will address this throughout the code

rhcarvalho · 2017-03-28T10:22:46Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            "tib": 1024 ** 4,
+            "tb": 1000 ** 4,
+            "pib": 1024 ** 5,
+            "pb": 1000 ** 5


I'm still puzzled about we doing these conversions this way...

But anyway, a nit but part of learning "the Python way":

1000**5,

The ** operator is typically not surrounded by spaces.

Add a trailing comma to this type of 'lists'.

I was not sure what specific units to expect, so this at least covers standard cases. @tbielawa has written a great python module that we could use in cases such as this one if we are willing to include it in our openshift-ansible image / make it a requirement on the host machine. wdyt?

I'd prefer not to do any conversions :-)
"the best code is no code at all"

While I am certainly for this, I am not sure to what extent we can guarantee that the units returned by devicemapper's Data Space Total, Data Space Available will be consistent in all cases

rhcarvalho · 2017-03-28T10:28:27Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+
+    @staticmethod
+    def to_gigabytes(total_bytes):
+        return total_bytes / 1000000000


Here we also need some clarity if we should be using base 10^3 or 2^10.
https://en.wikipedia.org/wiki/Byte#Unit_multiples

Would be easier to read a formula than counting zeros:

total_bytes / 1000000000 total_bytes / (10**3)**3 total_bytes / 1000**3

rhcarvalho · 2017-03-28T11:54:44Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+    def containerized_volume_check(self, task_vars):
+        docker_facts = self.module_executor("docker_facts", {}, task_vars)
+        info = docker_facts.get("docker_facts")
+        status = dict(info.get("DriverStatus"))


>>> dict(None) TypeError: 'NoneType' object is not iterable

>>> info = None >>> info.get("DriverStatus") AttributeError: 'NoneType' object has no attribute 'get'

rhcarvalho · 2017-03-28T11:58:33Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+            failed, msg = self.containerized_volume_check(task_vars)
+            if failed:
+                return {"failed": True, "msg": msg}
+


The way it stands now, we will call both containerized_volume_check and noncontainerized_volume_check when is_containerized is true and failed is false...

Thanks for catching that :)

rhcarvalho · 2017-03-28T12:00:27Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+        if "masters" in group_names:
+            return self.recommended_master_space_gb
+
+        return self.recommended_node_space_gb


This is also assuming that everything that is not a master is a node. If we expect to run this check on non-master-non-node hosts, we should follow the docs and use the proper storage required for each type of host.

hm, will go ahead and use info from https://docs.openshift.com/container-platform/3.4/install_config/install/prerequisites.html#hardware, however it only contains reqs for masters, nodes, and etcd

tbielawa · 2017-03-28T16:46:13Z

I'm not sure if it is showing up anymore, due to the folded 'Hide Outdated' segments above, but I did post another comment with more information on bitmath and the problem of converting/parsing inconsistent units. Direct link to above comment here

#3518 (comment)

rhcarvalho · 2017-03-29T14:01:55Z

roles/openshift_health_checker/openshift_checks/diskspace_availability.py

+from openshift_checks import OpenShiftCheck, get_var
+
+
+class DiskAvailability(OpenShiftCheck):


The name of the file is inconsistent with the check name and class name.

sosiouxme · 2017-04-03T13:30:01Z

roles/openshift_health_checker/library/docker_facts.py

+def main():
+    client = AnsibleDockerClient(
+        argument_spec=dict(
+        ),


This looks odd. Why not make it one line...

sosiouxme · 2017-04-03T14:45:04Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+    name = "disk_availability"
+    tags = ["preflight"]
+
+    recommended_diskspace = {


How about a comment about why it's base 10, and a link to origin docs on where this recommendation comes from?

sosiouxme · 2017-04-03T15:15:34Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+    }
+
+    minimum_docker_space_percent = 5.0
+    supported_docker_dev_driver = "devicemapper"


seems conceptually like it should be a list (of one currently)

Yeah, I was thinking of removing the check for the storage driver from this PR altogether, and just handling it as part of the docker_storage_check PR https://github.com/openshift/openshift-ansible/pull/3787/files#diff-c3d0f0c6a0d94ad7d6c5a62da695118dR17

Actually, remembering a very similar conversation I had with @rhcarvalho about this, it would be better to have the storage driver check be its own preflight check, allowing this PR and #3787 to focus on a single thing. WDYT?

Err, belatedly... yes.

sosiouxme · 2017-04-03T15:39:20Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+
+        if is_containerized:
+            failed, msg = self.containerized_volume_check(task_vars)
+            if failed:


Could this not be simplified to:

if is_containerized: failed, msg = self.containerized_volume_check(task_vars) else: failed, msg = self.noncontainerized_volume_check(ansible_mounts, task_vars) return {"failed": failed, "changed": False, "msg": msg}

Also, even in the case of non-containerized components, https://docs.openshift.org/latest/install_config/install/prerequisites.html indicates nodes should have docker storage of 15GB free. Seems like the code could be reused for that.

sosiouxme · 2017-04-03T17:36:53Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+        failed, msg = self.noncontainerized_volume_check(ansible_mounts, task_vars)
+        return {"failed": failed, "msg": msg}
+
+    def containerized_volume_check(self, task_vars):


A docstring is in order here, e.g.

"""A contained component uses only docker storage, so check docker storage info against requirements."""

Thanks, per the comments on #3518 (comment), I have decided to leave all aspects of the containerized check to #3787. (I will further refine that PR, rename it so the name is consistent with the check in this PR, and split out the storage driver check from it into its own check). Any of your feedback on #3787 is definitely welcome :)

sosiouxme · 2017-04-03T18:36:06Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+        openshift_diskfree_bytes = self.get_openshift_disk_availability(ansible_mounts)
+
+        if openshift_diskfree_bytes < recommended_diskspace_bytes:
+            msg = ("Available disk space ({diskfree} GB) for OpenShift volume below recommended storage. "


"for volume containing /var" - there is no "OpenShift volume"

sosiouxme · 2017-04-03T19:25:31Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+        if not units.get(segments[1]):
+            return 0
+
+        byte_size = float(segments[0]) * float(units.get(segments[1]))


why float? bytes are integers...

sosiouxme · 2017-04-03T19:28:38Z

roles/openshift_health_checker/openshift_checks/disk_availability.py

+
+
+class DiskAvailability(OpenShiftCheck):
+    """Check that recommended disk space is available."""


Not sure how else we should annotate it but it should be clear that this is disk space prior to an install. If running preflight checks before an upgrade, they may have used a bunch of the space and that should be fine; we should probably not even run this check in that case. We would do a different kind of calculation for hosts that are already installed and running.

sosiouxme · 2017-04-03T19:34:19Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+    tags = ["preflight"]
+
+    recommended_memory_mb = {
+        "nodes": 8000.0,


This is memory though which isn't coming from docker info... comes from ansible setup. Whatever our docs say, nobody measures RAM in GB (only disk manufacturers get to pull that crap), so it seems to me MiB/GiB should be used.

sosiouxme · 2017-04-03T19:35:32Z

roles/openshift_health_checker/openshift_checks/memory_availability.py

+
+    def run(self, tmp, task_vars):
+        group_names = get_var(task_vars, "group_names", default=[])
+        memoryfree = get_var(task_vars, "ansible_memtotal_mb")


nit, shouldn't memoryfree be named total_memory or similar?

rhcarvalho · 2017-04-13T14:22:22Z

@ncdc thanks for taking a look! I have seen that, but I still do not understand it.

INFO: Error running cluster/log-dump.sh...: looks like an error but is not, according to @stevekuznetsov on email:

This is a red-herring message that has been around for a long time. Some part of the Kubernetes e2e framework is trying to run a script that doesn't exist. It does not cause a failure and happens on every e2e run. I think the previous consensus was that we don't want to carry yet another UPSTREAM commit to get rid of the function in our vendored k8s code as it's benign. I can understand how this is confusing, though, and we can revisit that decision.
SUCCESS!, PASS, Test Suite Passed: that means we're good, right? :-)
[ERROR] PID 26302: hack/lib/cleanup.sh:74: [[ -n "${OS_DEBUG:-}" ]] exited with status 1.: what should we make of this?! Failed to dump container logs?

stevekuznetsov · 2017-04-13T14:30:31Z

The stacktrace isn't being helpful here unfortunately, see openshift/origin#13759 -- I'll work on the stacktrace being better soon. The error may have been in container cleanup? Unclear, unfortunately.

- only support a fixed list of recommended values for now, no overwriting via Ansible variables (keep it simple, add features as needed). - implement is_active: run this check only for hosts that have a recommended disk space. - test priority of mount paths / and /var.

- Expose only is_active and no other method. - Move general comment to module docstring.

- Fix required memory for etcd hosts (10 -> 20 GB), as per documentation. - Some changes to make the code more similar to the similar DiskAvailability check. - Do not raise exception for hosts that do not have a recommended memory value (those are ignored anyway through `is_active`, so that was essentially dead code). - Test that the required memory is the max of the recommended memories for all groups assigned to a host. E.g. if a host is master and node, we should check that it has enough memory to be a master, because the memory requirement for a master is higher than for a node.

openshift-bot · 2017-04-17T17:31:35Z

Evaluated for openshift ansible test up to 2aca8a4

rhcarvalho · 2017-04-18T12:43:08Z

aos-ci-test

openshift-bot · 2017-04-18T13:18:39Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 2aca8a4 (logs)

openshift-bot · 2017-04-18T13:19:30Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 2aca8a4 (logs)

openshift-bot · 2017-04-18T13:28:40Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 2aca8a4 (logs)

openshift-bot · 2017-04-18T13:30:00Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 2aca8a4 (logs)

rhcarvalho · 2017-04-18T13:36:00Z

[merge]

juanvallejo · 2017-04-18T13:45:35Z

merge test flaked on openshift/origin#13271 re[merge]

juanvallejo · 2017-04-18T15:33:02Z

merge flaked on openshift/origin#13271
re[merge]

openshift-bot · 2017-04-18T22:15:24Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/49/) (Base Commit: 00095b2)

rhcarvalho · 2017-04-19T10:48:08Z

Test failure is a bug in the CI job -- openshift/origin#13806

[merge]

rhcarvalho · 2017-04-19T10:51:55Z

@sdodson do we have documented anywhere what are the required checks before a PR can be merged? Now in this PR I do not know if we need to re-trigger aos-ci-test.

Would be useful to operate like Origin -- where the [merge] automatically runs the tests it needs, without requiring 2 different bot commands, is the plan going in that direction?

stevekuznetsov · 2017-04-19T11:58:46Z

The configuration for branches is here. We do aim to move more and more tests away from AOS CI and onto the Origin CI, and aid in that effort is greatly appreciated, but it's not trivial and will take some time.

rhcarvalho · 2017-04-19T12:04:07Z

@stevekuznetsov thanks for the pointer. Which branch should we refer to to the latest version of that file / what is currently enforced in in-flight PRs?

stevekuznetsov · 2017-04-19T18:37:53Z

https://github.com/openshift/aos-cd-jobs/blob/master/sjb/test_status_config.yml

sdodson · 2017-04-19T19:20:03Z

re[merge]
EXPOSE THE KUBECONFIG known issue

openshift-bot · 2017-04-19T19:23:29Z

Evaluated for openshift ansible merge up to 2aca8a4

openshift-bot · 2017-04-19T21:31:25Z

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/254/) (Base Commit: 9ace041)

rhcarvalho · 2017-04-20T09:24:31Z

Last flake was openshift/origin#13814.

Since the code here is not related to installs nor upgrades, is strictly contained in the openshift_health_checker role, merge failed at least 8 times with known flakes linked above, and aos-ci-test and Travis are green, we'll use the temporary exception to merge it manually.

juanvallejo commented Feb 27, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch from a366cd2 to 0391ebc Compare February 27, 2017 22:54

juanvallejo commented Feb 27, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch 4 times, most recently from 2ee2091 to 0deaaa8 Compare February 27, 2017 23:01

detiber reviewed Feb 28, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch from 2420583 to d93cc16 Compare February 28, 2017 23:28

sosiouxme reviewed Mar 1, 2017

View reviewed changes

detiber reviewed Mar 2, 2017

View reviewed changes

rhcarvalho reviewed Mar 2, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch 2 times, most recently from 1fe382c to d2e2d58 Compare March 2, 2017 19:57

brenton changed the title ~~[WIP] add ram and storage preflight check~~ add ram and storage preflight check Mar 9, 2017

detiber reviewed Mar 9, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch from e7bc690 to 43680e3 Compare March 10, 2017 23:40

rhcarvalho reviewed Mar 27, 2017

View reviewed changes

rhcarvalho suggested changes Mar 27, 2017

View reviewed changes

rhcarvalho reviewed Mar 27, 2017

View reviewed changes

tbielawa reviewed Mar 27, 2017

View reviewed changes

rhcarvalho suggested changes Mar 28, 2017

View reviewed changes

rhcarvalho reviewed Mar 29, 2017

View reviewed changes

sosiouxme requested changes Apr 3, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch from 81fef87 to 1af86d7 Compare April 6, 2017 21:06

juanvallejo changed the title ~~add ram and storage preflight check~~ add ram and non-containerized storage preflight check Apr 6, 2017

juanvallejo and others added 5 commits April 17, 2017 13:29

add ram and storage preflight check

d295518

add disk and memory availability check tests

58fdef2

Simplify mixin class

59e781b

- Expose only is_active and no other method. - Move general comment to module docstring.

juanvallejo force-pushed the jvallejo/ram_disk_space_checker branch from 9cb85c9 to 2aca8a4 Compare April 17, 2017 17:29

rhcarvalho mentioned this pull request Apr 18, 2017

Disable swap space on nodes at install and upgrade #3884

Merged

rhcarvalho merged commit 3820dab into openshift:master Apr 20, 2017

juanvallejo deleted the jvallejo/ram_disk_space_checker branch April 20, 2017 13:30

		from openshift_checks.mixins import NotContainerizedMixin


		class RamDiskspaceAvailability(NotContainerizedMixin, OpenShiftCheck):


		recommended_memory_gb = max(self.recommended_memory_gb.get(group, 0) for group in group_names)

		if self.to_gigabytes(memoryfree) < recommended_memory_gb:

		from openshift_checks import OpenShiftCheck, get_var


		class DiskAvailability(OpenShiftCheck):



		class DiskAvailability(OpenShiftCheck):
		"""Check that recommended disk space is available."""

add ram and non-containerized storage preflight check #3518

add ram and non-containerized storage preflight check #3518

Conversation

juanvallejo commented Feb 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Mar 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbielawa Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo Mar 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Feb 27, 2017 •

edited

Loading

juanvallejo commented Mar 9, 2017 •

edited

Loading

tbielawa Mar 28, 2017 •

edited

Loading

juanvallejo Mar 28, 2017 •

edited

Loading