[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

albinsun · 2024-10-21T17:17:06Z

Describe the bug

During validate #5069, we found that no validation on Maintenance Strategy when Edit as YAML.
And the VM with unexpected configuration will be created successfully...

Warning

If the wrongly configured VM locates on the first node node-0, Enter Maintenance Mode on node-0 will cause it stuck in Cordoned.
See Additional context.

To Reproduce

Steps to reproduce the behavior:

Setup a 3 nodes witness cluster
Create image ubuntu-22.04-server-cloudimg-amd64.img
Create network mgmt-vlan1
Virtual Machines => Create = Enter Name, CPU, Memory and Image => Edit as YAML => Add harvesterhci.io/maintain-mode-strategy: xxx in metadata.labels => Create
❌ Should FAIL to create VM cuz validation error
- VM Running (Should NOT) and with empty Maintenance Strategy
- Edit YAML

Expected behavior

Should FAIL to create VM cuz validation error since there are only 4 valid values:

Migrate (default)
ShutdownAndRestartAfterEnable
ShutdownAndRestartAfterDisable
Shutdown

Support bundle

Before Enter Maintenance Mode on node-0: supportbundle_DoesNotValidateMaintenanceStrategy_2024-10-21T16-40-26Z.zip
After Enter Maintenance Mode on node-0: supportbundle_MigrateXXX_2024-10-21T17-05-00Z.zip

Environment

Harvester ISO version: v1.4.0-rc3
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 3 nodes witness AMD64 QEMU/KVM includes 1 witness node.

Additional context

If the wrongly configured VM locates on the first node node-0, Enter Maintenance Mode on node-0 will cause it stuck in Cordoned.

Manually move the wrongly configured VM to node-0
Enter Maintenance Mode on node-0, it stucks Cordoned
A workaround here is to manually stop or migrate the problemed VM
- Stop VM
- And node-0 can entering maintenance mode

The text was updated successfully, but these errors were encountered:

The Pod deletion filter identifies the Pods belonging to VMs, which need to be deleted during a node drain by looking at their maintenance-mode strategy label. The default value for this label is `Migrate`, which indicates that the Pod should be deleted during the node drain. Pods with an invalid label, i.e. where the value of the maintenance-mode strategy label is not one of: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown are now also being treated with the default behavior. This fixes the problem that if a VM contains an invalied value in this label, nodes can become stuck in `Cordoned` state when transitioning to maintenance mode, as the node drain controller won't shut down the VM, but it's also not migrated away from the node. Therefore the VM keeps running, preventing the node from completely transitioning into maintenance mode. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>

Add checks for maintenance-mode strategy to VM webhooks. The Admission webhooks for the VirtualMachine resource needs to make sure that the maintenance-mode strategy for the VM is set to a sane value. To do this, there are two checks: The first one is in the mutating webhook, which is executed first, and it just makes sure that the label, which defines the maintenance-mode strategy, is set. If the label is not set, the mutating webhook will set it with the default value of `Migrate`. If the label is set, the mutating webhook will not modify it, even if it has an invalid value. This ensures that the maintenance-mode strategy is never set unintentionally to a wrong value. The second check will ensure that in this case the request is rejected with an error message, so the user can correct the value of the maintenance-mode strategy label. The second check happens in the validating webhook. This check ensures that the maintenance-mode strategy label is set to a valid maintenance-mode strategy, i.e. one of the values: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown This webhook will deny a CREATE or UPDATE, if the new VirtualMachine resource does not contain the maintenance-mode strategy label at all, or if it contains an invalid value. The only exception is an UPDATE to a VirtualMachine resource that already contains an invalid value in the maintenance-mode strategy label and where the value does not change. In this case, the webhook will accept the request. This is crucial, in case the controller needs to deal with an existing VM that has an invalid value in this label (e.g. on a cluster that has been upgraded from an old version, before this label was checked by the admission webhook). In this case, the controller still needs to be able to perform UPDATE operations on the resource, to operate the VM. Together, these two checks ensure that no VirtualMachine resource can be created with an invalid maintenance-mode strategy, or with no maintenance-mode strategy at all. They also make sure that the maintenance-mode strategy can not be removed or changed to an invalid value for existing VirtualMachine resources. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>

m-ildefons · 2024-12-13T10:39:55Z

There are several problems here:

When you create a VirtualMachine from the UI and you leave the setting at the default (Migrate), the label isn't set at all
This is actually not a big deal, since it can be fixed in the admission webhook by just letting the mutating webhook add the label if it's not already there.
When you create a VirtualMachine from the UI and you do set the setting, or add the label in the YAML, then the label isn't transferred to the Pod. The UI (or the user) would need to set the label under .spec.template.metadata.labels, not under .metadata.labels in the VirtualMachine resource, or the admission webhook needs to make sure the label is transferred into the template.
The node-drain controller only deletes Pods for VirtualMachines, when the Pod has the label and it's set to Migrate, omitting the default behavior, when the label isn't set at all. But it also assumes that all VirtualMachines that should be shut down have the label set and it's value is one of the other three valid settings. This leaves all Pods of VirtualMachines, which have an invalid value in the label or don't have the label set at all, dangling.
The admission webhook doesn't check if the label is set, or if it has a valid value.
(Optional) Go through the Terraform provider and check that the schema for the virtual machine resources enforces that only valid values are parsed from .tf files. Otherwise the .tf files should be rejected with a helpful error message.

Add checks for maintenance-mode strategy to VM webhooks. The Admission webhooks for the VirtualMachine resource needs to make sure that the maintenance-mode strategy for the VM is set to a sane value. To do this, there are two checks: The first one is in the mutating webhook, which is executed first, and it just makes sure that the label, which defines the maintenance-mode strategy, is set. If the label is not set, the mutating webhook will set it with the default value of `Migrate`. If the label is set, the mutating webhook will not modify it, even if it has an invalid value. This ensures that the maintenance-mode strategy is never set unintentionally to a wrong value. The second check will ensure that in this case the request is rejected with an error message, so the user can correct the value of the maintenance-mode strategy label. The mutating webhook will also ensure that the maintenance-mode strategy label is copied from the `.metadata.labels` to `.spec.template.metadata.labels`. This is necessary to ensure that the Pod in which the virtual machine will run will be labeled correctly. The second check happens in the validating webhook. This check ensures that the maintenance-mode strategy label is set to a valid maintenance-mode strategy, i.e. one of the values: - Migrate - ShutdownAndRestartAfterEnable - ShutdownAndRestartAfterDisable - Shutdown This webhook will deny a CREATE or UPDATE, if the new VirtualMachine resource does not contain the maintenance-mode strategy label at all, or if it contains an invalid value. The only exception is an UPDATE to a VirtualMachine resource that already contains an invalid value in the maintenance-mode strategy label and where the value does not change. In this case, the webhook will accept the request. This is crucial, in case the controller needs to deal with an existing VM that has an invalid value in this label (e.g. on a cluster that has been upgraded from an old version, before this label was checked by the admission webhook). In this case, the controller still needs to be able to perform UPDATE operations on the resource, to operate the VM. Together, these two checks ensure that no VirtualMachine resource can be created with an invalid maintenance-mode strategy, or with no maintenance-mode strategy at all. They also make sure that the maintenance-mode strategy can not be removed or changed to an invalid value for existing VirtualMachine resources. related-to: harvester#6835 Signed-off-by: Moritz Röhrich <[email protected]>

m-ildefons · 2024-12-17T12:59:30Z

The maintenance-mode strategy was introduced in v1.4.0. @innobead should this be fixed on master branch only, i.e. for v1.5.0, or should this also be backported to fix it for v1.4.1?

innobead · 2024-12-18T00:52:37Z

The maintenance-mode strategy was introduced in v1.4.0. @innobead should this be fixed on master branch only, i.e. for v1.5.0, or should this also be backported to fix it for v1.4.1?

yes, fix this in the master and add a backport label, and the corresponding backport issues will be created.

harvesterhci-io-github-bot · 2024-12-18T09:18:27Z

added backport-needed/1.4.1 issue: #7190.

albinsun added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/3 Function working but has a major issue w/ workaround need-reprioritize reproduce/always Reproducible 100% of the time area/node-maintenance labels Oct 21, 2024

albinsun added this to the v1.4.0 milestone Oct 21, 2024

albinsun mentioned this issue Oct 21, 2024

[FEATURE] Improving Harvester Maintenance Mode #5069

Closed

bk201 modified the milestones: v1.4.0, v1.5.0 Oct 23, 2024

bk201 assigned votdev Oct 23, 2024

innobead assigned m-ildefons and unassigned votdev Dec 12, 2024

m-ildefons mentioned this issue Dec 13, 2024

Fix: Enforce consistent Maintenance Mode Strategy behavior #7160

Open

m-ildefons added the backport-needed/1.4.1 label Dec 18, 2024

harvesterhci-io-github-bot mentioned this issue Dec 18, 2024

[backport v1.4] [BUG] No validation on Maintenance Strategy when Edit as YAML #7190

Open

innobead added area/admission-webhook Kubernetes validation and mutating admission webhooks and removed need-reprioritize labels Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

albinsun commented Oct 21, 2024

m-ildefons commented Dec 13, 2024

m-ildefons commented Dec 17, 2024

innobead commented Dec 18, 2024

harvesterhci-io-github-bot commented Dec 18, 2024

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

Comments

albinsun commented Oct 21, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle

Environment

Additional context

m-ildefons commented Dec 13, 2024

m-ildefons commented Dec 17, 2024

innobead commented Dec 18, 2024

harvesterhci-io-github-bot commented Dec 18, 2024