[CPDEV-101898] - do not create empty /etc/kubernetes/nodes-k8s-versions.txt file #663

Imadzuma · 2024-05-23T07:19:40Z

Description

During upgrade procedure we create special temporary file with node versions: /etc/kubernetes/nodes-k8s-versions.txt.
This file is needed to continue from the same step after failed upgrade: https://github.com/Netcracker/KubeMarine/blob/main/documentation/Maintenance.md#nodes-saved-versions-before-upgrade
This file is created with such complex command: https://github.com/Netcracker/KubeMarine/blob/v0.30.0/kubemarine/kubernetes/__init__.py#L1002
But if something went wrong, when kubemarine calls kubectl get nodes command (e.g. etcd restarts for some reason), no exceptions will be called and empty /etc/kubernetes/nodes-k8s-versions.txt will be created.
After that the exception appears in the next parsing of this file:

2024-05-16 15:44:30.854 +0300 INFO *** TASK prepull_images ***
2024-05-16 15:44:30.854 +0300 DEBUG Prepulling Kubernetes images...
2024-05-16 15:44:32.460 +0300 CRITICAL FAILURE!
2024-05-16 15:44:32.460 +0300 CRITICAL TASK FAILED prepull_images
2024-05-16 15:44:32.464 +0300 CRITICAL KME0001: Unexpected exception
2024-05-16 15:44:32.464 +0300 Traceback (most recent call last):
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 381, in run_tasks_recursive
2024-05-16 15:44:32.464 +0300     task(cluster)
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/procedures/upgrade.py", line 44, in prepull_images
2024-05-16 15:44:32.464 +0300     fix_cri_socket(cluster)
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/procedures/upgrade.py", line 286, in fix_cri_socket
2024-05-16 15:44:32.464 +0300     upgrade_group = kubernetes.get_group_for_upgrade(cluster)
2024-05-16 15:44:32.464 +0300                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/kubernetes/__init__.py", line 1052, in get_group_for_upgrade
2024-05-16 15:44:32.464 +0300     nodes_for_upgrade = autodetect_non_upgraded_nodes(cluster, version)
2024-05-16 15:44:32.464 +0300                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/kubernetes/__init__.py", line 1013, in autodetect_non_upgraded_nodes
2024-05-16 15:44:32.464 +0300     raise Exception('Remote result did not returned any lines containing node info')
2024-05-16 15:44:32.464 +0300 Exception: Remote result did not returned any lines containing node info

The problem is complicated by the fact that as a result of such a shutdown, this file is not deleted, so if the upgrade procedure is restarted when the cluster will be OK, it'll continue failing because of empty /etc/kubernetes/nodes-k8s-versions.txt.
To resolve this issue, this file should be removed manually, but it's not obvious for user: in fact, this file is absolutely empty without the first line comment about its purpose.

Solution

The complex command is spited to several commands, that are called separately, so if kubectl get nodes command fails, kubemarine throws the exception, that is handled and empty file is not created;
Parsing kubectl get nodes result is moved to the kubemarine code instead of using sed command;

Test Cases

TestCase 1

Test Configuration:

Hardware:
OS: any;
Inventory: any;

Steps:

Run kubemarine install and wait the successful installation;
Run kubemarine upgrade;
Restart etcd when kubemarine runs prepull_images (the best way is using breakpoints to fail etcd right before autodetect_non_upgraded_nodes function;
Wait, when kubemarine upgrade finishes (successfully or not);
Rerun kubemarine upgrade when the etcd will be restarted;

Results:

Before	After
The empty `/etc/kubernetes/nodes-k8s-versions.txt` is created on the first control-plane after step 4	No `etc/kubernetes/nodes-k8s-versions.txt` after step 4
`kubemarine upgrade` on step 5 fails	`kubemarine upgrade` on step 5 is successful
`kubemarine upgrade` on step 2-4 always fails	`kubemarine upgrade` on step 2-4 can be finished successful if failed etcd restarts quickly and does not affect other commands

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
Integration CI passed
Unit tests. If Yes list of new/changed tests with brief description
There is no merge conflicts

Unit tests

Indicate new or changed unit tests and what they do, if any.

kubemarine/kubernetes/__init__.py

Co-authored-by: ilia1243 <[email protected]>

do not create empty /etc/kubernetes/nodes-k8s-versions.txt file

bbeb1fc

Imadzuma requested review from koryaga, ilia1243 and alexarefev May 23, 2024 07:19

ilia1243 reviewed May 23, 2024

View reviewed changes

kubemarine/kubernetes/__init__.py Outdated Show resolved Hide resolved

ilia1243 approved these changes May 23, 2024

View reviewed changes

Update kubemarine/kubernetes/__init__.py

5a2bd63

Co-authored-by: ilia1243 <[email protected]>

koryaga added the bug Something isn't working label May 23, 2024

koryaga merged commit 5d6b161 into main May 23, 2024
44 checks passed

koryaga deleted the bugfix/nodes-versions-upgrade-temp-file branch May 23, 2024 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPDEV-101898] - do not create empty /etc/kubernetes/nodes-k8s-versions.txt file #663

[CPDEV-101898] - do not create empty /etc/kubernetes/nodes-k8s-versions.txt file #663

Imadzuma commented May 23, 2024

[CPDEV-101898] - do not create empty /etc/kubernetes/nodes-k8s-versions.txt file #663

[CPDEV-101898] - do not create empty /etc/kubernetes/nodes-k8s-versions.txt file #663

Conversation

Imadzuma commented May 23, 2024

Description

Solution

Test Cases

Checklist

Unit tests