Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPDEV-101898] - do not create empty /etc/kubernetes/nodes-k8s-versions.txt file #663

Merged
merged 2 commits into from
May 23, 2024

Conversation

Imadzuma
Copy link
Contributor

Description

During upgrade procedure we create special temporary file with node versions: /etc/kubernetes/nodes-k8s-versions.txt.
This file is needed to continue from the same step after failed upgrade: https://github.com/Netcracker/KubeMarine/blob/main/documentation/Maintenance.md#nodes-saved-versions-before-upgrade
This file is created with such complex command: https://github.com/Netcracker/KubeMarine/blob/v0.30.0/kubemarine/kubernetes/__init__.py#L1002
But if something went wrong, when kubemarine calls kubectl get nodes command (e.g. etcd restarts for some reason), no exceptions will be called and empty /etc/kubernetes/nodes-k8s-versions.txt will be created.
After that the exception appears in the next parsing of this file:

2024-05-16 15:44:30.854 +0300 INFO *** TASK prepull_images ***
2024-05-16 15:44:30.854 +0300 DEBUG Prepulling Kubernetes images...
2024-05-16 15:44:32.460 +0300 CRITICAL FAILURE!
2024-05-16 15:44:32.460 +0300 CRITICAL TASK FAILED prepull_images
2024-05-16 15:44:32.464 +0300 CRITICAL KME0001: Unexpected exception
2024-05-16 15:44:32.464 +0300 Traceback (most recent call last):
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/core/flow.py", line 381, in run_tasks_recursive
2024-05-16 15:44:32.464 +0300     task(cluster)
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/procedures/upgrade.py", line 44, in prepull_images
2024-05-16 15:44:32.464 +0300     fix_cri_socket(cluster)
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/procedures/upgrade.py", line 286, in fix_cri_socket
2024-05-16 15:44:32.464 +0300     upgrade_group = kubernetes.get_group_for_upgrade(cluster)
2024-05-16 15:44:32.464 +0300                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/kubernetes/__init__.py", line 1052, in get_group_for_upgrade
2024-05-16 15:44:32.464 +0300     nodes_for_upgrade = autodetect_non_upgraded_nodes(cluster, version)
2024-05-16 15:44:32.464 +0300                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-16 15:44:32.464 +0300   File "/usr/local/lib/python3.12/site-packages/kubemarine/kubernetes/__init__.py", line 1013, in autodetect_non_upgraded_nodes
2024-05-16 15:44:32.464 +0300     raise Exception('Remote result did not returned any lines containing node info')
2024-05-16 15:44:32.464 +0300 Exception: Remote result did not returned any lines containing node info 

The problem is complicated by the fact that as a result of such a shutdown, this file is not deleted, so if the upgrade procedure is restarted when the cluster will be OK, it'll continue failing because of empty /etc/kubernetes/nodes-k8s-versions.txt.
To resolve this issue, this file should be removed manually, but it's not obvious for user: in fact, this file is absolutely empty without the first line comment about its purpose.

Solution

  • The complex command is spited to several commands, that are called separately, so if kubectl get nodes command fails, kubemarine throws the exception, that is handled and empty file is not created;
  • Parsing kubectl get nodes result is moved to the kubemarine code instead of using sed command;

Test Cases

TestCase 1

Test Configuration:

  • Hardware:
  • OS: any;
  • Inventory: any;

Steps:

  1. Run kubemarine install and wait the successful installation;
  2. Run kubemarine upgrade;
  3. Restart etcd when kubemarine runs prepull_images (the best way is using breakpoints to fail etcd right before autodetect_non_upgraded_nodes function;
  4. Wait, when kubemarine upgrade finishes (successfully or not);
  5. Rerun kubemarine upgrade when the etcd will be restarted;

Results:

Before After
The empty /etc/kubernetes/nodes-k8s-versions.txt is created on the first control-plane after step 4 No etc/kubernetes/nodes-k8s-versions.txt after step 4
kubemarine upgrade on step 5 fails kubemarine upgrade on step 5 is successful
kubemarine upgrade on step 2-4 always fails kubemarine upgrade on step 2-4 can be finished successful if failed etcd restarts quickly and does not affect other commands

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Integration CI passed
  • Unit tests. If Yes list of new/changed tests with brief description
  • There is no merge conflicts

Unit tests

Indicate new or changed unit tests and what they do, if any.

@koryaga koryaga added the bug Something isn't working label May 23, 2024
@koryaga koryaga merged commit 5d6b161 into main May 23, 2024
44 checks passed
@koryaga koryaga deleted the bugfix/nodes-versions-upgrade-temp-file branch May 23, 2024 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants