-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add troubleshooting of node maintenance mode #619
Open
w13915984028
wants to merge
3
commits into
harvester:main
Choose a base branch
from
w13915984028:doc6264
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -21,10 +21,36 @@ Because Harvester is built on top of Kubernetes and uses etcd as its database, t | |||||
|
||||||
## Node Maintenance | ||||||
|
||||||
In the following scenarios, you plan to migrate/shutdown the workloads from one node and also possibly shutdown the node. | ||||||
|
||||||
- Replace/add/remove hardware | ||||||
|
||||||
- Change the network setting | ||||||
|
||||||
- Troubleshooting | ||||||
|
||||||
- Remove a node | ||||||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Harvester provides `Node Maintenance` feature to run a series of checks and operations automatically. | ||||||
|
||||||
For admin users, you can click **Enable Maintenance Mode** to evict all VMs from a node automatically. It will leverage the `VM live migration` feature to migrate all VMs to other nodes automatically. Note that at least two active nodes are required to use this feature. | ||||||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
![node-maintenance.png](/img/v1.2/host/node-maintenance.png) | ||||||
|
||||||
After a while the target node will enter `Maintenance Mode` successfully. | ||||||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png) | ||||||
|
||||||
:::info important | ||||||
|
||||||
Check those [known limitations and workarounds](../troubleshooting/host.md#an-enable-maintenance-mode-node-stucks-on-cordoned-state) before you click this menu or when you have encountered some issues. | ||||||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
If you have attached any volume to this node manually, it may block the `Node Maintenance`, check the section [Manually Attached Volumes](../troubleshooting/host.md#manually-attached-volumes) and set a proper global option. | ||||||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
If you have any single-replica volume, it may block the `Node Maintenance`, check the section [Single-Replica Volumes](../troubleshooting/host.md#single-replica-volumes) and set a proper global option. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
::: | ||||||
|
||||||
## Cordoning a Node | ||||||
|
||||||
Cordoning a node marks it as unschedulable. This feature is useful for performing short tasks on the node during small maintenance windows, like reboots, upgrades, or decommissions. When you’re done, power back on and make the node schedulable again by uncordoning it. | ||||||
|
@@ -42,6 +68,8 @@ Before removing a node from a Harvester cluster, determine if the remaining node | |||||
|
||||||
If the remaining nodes do not have enough resources, VMs might fail to migrate and volumes might degrade when you remove a node. | ||||||
|
||||||
If you have some volumes which were created from the customized `StorageClass` with the value **1** of the [Number of Replicas](../advanced/storageclass.md#number-of-replicas), it is recommended to backup those single-replica volumes or re-deploy the related workloads to other node in advance to get the volume scheduled to other node. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed. | ||||||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
::: | ||||||
|
||||||
### 1. Check if the node can be removed from the cluster. | ||||||
|
@@ -522,4 +550,4 @@ status: | |||||
``` | ||||||
|
||||||
The `harvester-node-manager` pod(s) in the `harvester-system` namespace may also contain some hints as to why it is not rendering a file to a node. | ||||||
This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest. | ||||||
This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
--- | ||
sidebar_position: 6 | ||
sidebar_label: Host | ||
title: "Host" | ||
--- | ||
|
||
<head> | ||
<link rel="canonical" href="https://docs.harvesterhci.io/v1.3/troubleshooting/host"/> | ||
</head> | ||
|
||
## Node in Maintenance Mode Becomes Stuck in Cordoned State | ||
|
||
When you enable `Maintenance Mode` on a node using the Harvester UI, the node becomes stuck in the `Cordoned` state and the menu shows the **Enable Maintenance Mode** option instead of **Disable Maintenance Mode**. | ||
|
||
![node-stuck-cordoned.png](/img/v1.3/troubleshooting/node-stuck-cordoned.png) | ||
|
||
The Harvester pod logs contain messages similar to the following: | ||
|
||
``` | ||
time="2024-08-05T19:03:02Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" | ||
time="2024-08-05T19:03:02Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." | ||
|
||
time="2024-08-05T19:03:07Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" | ||
time="2024-08-05T19:03:07Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." | ||
|
||
time="2024-08-05T19:03:12Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7" | ||
time="2024-08-05T19:03:12Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget." | ||
``` | ||
|
||
The Longhorn Instance Manager uses a PodDisruptionBudget (PDB) to protect itself from accidental eviction, which results in loss of volume data. When the above error occurs, it indicates that the `instance-manager` pod is still serving volumes or replicas. | ||
|
||
The following sections describe the known causes and their corresponding workarounds. | ||
|
||
### Manually Attached Volumes | ||
|
||
A volume that is attached to a node using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards) can cause the error. This is because the object is attached to a node name instead of the pod name. | ||
|
||
You can check it from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
![attached-volume.png](/img/v1.3/troubleshooting/attached-volume.png) | ||
|
||
The manually attached object is attached to a node name instead of the pod name. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You can also use the CLI to retrieve the details of the CRD object `VolumeAttachment`. | ||
|
||
Example of a volume that was attached using the Longhorn UI: | ||
|
||
``` | ||
- apiVersion: longhorn.io/v1beta2 | ||
kind: VolumeAttachment | ||
... | ||
spec: | ||
attachmentTickets: | ||
longhorn-ui: | ||
id: longhorn-ui | ||
nodeID: node-name | ||
... | ||
volume: pvc-9b35136c-f59e-414b-aa55-b84b9b21ff89 | ||
``` | ||
|
||
Example of a volume that was attached using the Longhorn CSI driver: | ||
|
||
``` | ||
- apiVersion: longhorn.io/v1beta2 | ||
kind: VolumeAttachment | ||
spec: | ||
attachmentTickets: | ||
csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf: | ||
id: csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf | ||
nodeID: node-name | ||
... | ||
volume: pvc-3c6403cd-f1cd-4b84-9b46-162f746b9667 | ||
``` | ||
|
||
:::note | ||
|
||
Manually attaching a volume to the node is not recommended. | ||
|
||
Harvester automatically attaches/detaches volumes based on operations like creating or migrating VM. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
::: | ||
|
||
#### Workaround 1: Set `Detach Manually Attached Volumes When Cordoned` to `True` | ||
|
||
The Longhorn setting [Detach Manually Attached Volumes When Cordoned](https://longhorn.io/docs/1.6.0/references/settings/#detach-manually-attached-volumes-when-cordoned) blocks node draining when there are volumes manually attached to the node. | ||
|
||
The default value of this setting depends on the embedded Longhorn version: | ||
|
||
| Harvester version | Embedded Longhorn version | Default value | | ||
| --- | --- | --- | | ||
| v1.3.1 | v1.6.0 | `true` | | ||
| v1.4.0 | v1.7.0 | `false` | | ||
|
||
Set this option to `true` from the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Workaround 2: Manually Detach the Volume | ||
|
||
Detach the volume using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). | ||
|
||
![detached-volume.png](/img/v1.3/troubleshooting/detached-volume.png) | ||
|
||
Once the volume is detached, you can successfully enable `Maintenance Mode` on the node. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png) | ||
|
||
### Single-Replica Volumes | ||
|
||
Harvester allows you to create customized `StorageClasses` that describe how Longhorn must provision volumes. If necessary, you can create a `StorageClass` with the [Number of Replicas](../advanced/storageclass.md#number-of-replicas) parameter set to `1`. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When a volume is created using such a `StorageClass` and is attached to a node using the CSI driver or other methods, the single replica stays on that node even after the volume is detached. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You can check this using the CRD object `Volume`. | ||
|
||
``` | ||
- apiVersion: longhorn.io/v1beta2 | ||
kind: Volume | ||
... | ||
spec: | ||
... | ||
numberOfReplicas: 1 // the replica number | ||
... | ||
status: | ||
... | ||
ownerID: nodeName | ||
... | ||
state: attached | ||
``` | ||
|
||
#### Workaround: Set `Node Drain Policy` | ||
|
||
The Longhorn [Node Drain Policy](https://longhorn.io/docs/1.6.0/references/settings/#node-drain-policy) is set to `block-if-contains-last-replica` by default. This option forces Longhorn to block node draining when the node contains the last healthy replica of a volume. | ||
|
||
To address the issue, change the value to `allow-if-replica-is-stopped` using the [embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards). | ||
|
||
:::info important | ||
|
||
If you plan to remove the node after `Maintenance Mode` is enabled, backup those single-replica volumes or redeploy the related workloads to other nodes in advance so that the volumes are scheduled to other nodes. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
::: | ||
|
||
Starting with Harvester v1.4.0, the `Node Drain Policy` is set to `allow-if-replica-is-stopped` by default. | ||
w13915984028 marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.