Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

Open
albinsun opened this issue Oct 21, 2024 · 4 comments
Open

[BUG] No validation on Maintenance Strategy when Edit as YAML #6835

albinsun opened this issue Oct 21, 2024 · 4 comments
Assignees
Labels
area/admission-webhook Kubernetes validation and mutating admission webhooks area/node-maintenance backport-needed/1.4.1 kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/always Reproducible 100% of the time severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@albinsun
Copy link

Describe the bug

During validate #5069, we found that no validation on Maintenance Strategy when Edit as YAML.
And the VM with unexpected configuration will be created successfully...

Warning

If the wrongly configured VM locates on the first node node-0, Enter Maintenance Mode on node-0 will cause it stuck in Cordoned.
See Additional context.

To Reproduce

Steps to reproduce the behavior:

  1. Setup a 3 nodes witness cluster
  2. Create image ubuntu-22.04-server-cloudimg-amd64.img
  3. Create network mgmt-vlan1
  4. Virtual Machines => Create = Enter Name, CPU, Memory and Image => Edit as YAML => Add harvesterhci.io/maintain-mode-strategy: xxx in metadata.labels => Create
    image
  5. ❌ Should FAIL to create VM cuz validation error
    • VM Running (Should NOT) and with empty Maintenance Strategy
      image
    • Edit YAML
      image

Expected behavior

Should FAIL to create VM cuz validation error since there are only 4 valid values:

  1. Migrate (default)
  2. ShutdownAndRestartAfterEnable
  3. ShutdownAndRestartAfterDisable
  4. Shutdown

Support bundle

Environment

  • Harvester ISO version: v1.4.0-rc3
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 3 nodes witness AMD64 QEMU/KVM includes 1 witness node.

Additional context

If the wrongly configured VM locates on the first node node-0, Enter Maintenance Mode on node-0 will cause it stuck in Cordoned.

  1. Manually move the wrongly configured VM to node-0
    image
  2. Enter Maintenance Mode on node-0, it stucks Cordoned
    image
    image
  3. A workaround here is to manually stop or migrate the problemed VM
    • Stop VM
      image
    • And node-0 can entering maintenance mode
      image
@albinsun albinsun added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/3 Function working but has a major issue w/ workaround need-reprioritize reproduce/always Reproducible 100% of the time area/node-maintenance labels Oct 21, 2024
@albinsun albinsun added this to the v1.4.0 milestone Oct 21, 2024
@bk201 bk201 modified the milestones: v1.4.0, v1.5.0 Oct 23, 2024
@innobead innobead assigned m-ildefons and unassigned votdev Dec 12, 2024
m-ildefons added a commit to m-ildefons/harvester that referenced this issue Dec 12, 2024
The Pod deletion filter identifies the Pods belonging to VMs, which need
to be deleted during a node drain by looking at their maintenance-mode
strategy label. The default value for this label is `Migrate`, which
indicates that the Pod should be deleted during the node drain.
Pods with an invalid label, i.e. where the value of the maintenance-mode
strategy label is not one of:
  - Migrate
  - ShutdownAndRestartAfterEnable
  - ShutdownAndRestartAfterDisable
  - Shutdown
are now also being treated with the default behavior.
This fixes the problem that if a VM contains an invalied value in this
label, nodes can become stuck in `Cordoned` state when transitioning to
maintenance mode, as the node drain controller won't shut down the VM,
but it's also not migrated away from the node. Therefore the VM keeps
running, preventing the node from completely transitioning into
maintenance mode.

related-to: harvester#6835

Signed-off-by: Moritz Röhrich <[email protected]>
m-ildefons added a commit to m-ildefons/harvester that referenced this issue Dec 12, 2024
Add checks for maintenance-mode strategy to VM webhooks.

The Admission webhooks for the VirtualMachine resource needs to make
sure that the maintenance-mode strategy for the VM is set to a sane
value.
To do this, there are two checks:

The first one is in the mutating webhook, which is executed first, and
it just makes sure that the label, which defines the maintenance-mode
strategy, is set. If the label is not set, the mutating webhook will set
it with the default value of `Migrate`. If the label is set, the
mutating webhook will not modify it, even if it has an invalid value.
This ensures that the maintenance-mode strategy is never set
unintentionally to a wrong value.
The second check will ensure that in this case the request is rejected
with an error message, so the user can correct the value of the
maintenance-mode strategy label.

The second check happens in the validating webhook. This check ensures
that the maintenance-mode strategy label is set to a valid
maintenance-mode strategy, i.e. one of the values:
  - Migrate
  - ShutdownAndRestartAfterEnable
  - ShutdownAndRestartAfterDisable
  - Shutdown
This webhook will deny a CREATE or UPDATE, if the new VirtualMachine
resource does not contain the maintenance-mode strategy label at all, or
if it contains an invalid value.
The only exception is an UPDATE to a VirtualMachine resource that
already contains an invalid value in the maintenance-mode strategy label
and where the value does not change. In this case, the webhook will
accept the request. This is crucial, in case the controller needs to
deal with an existing VM that has an invalid value in this label (e.g.
on a cluster that has been upgraded from an old version, before this
label was checked by the admission webhook). In this case, the
controller still needs to be able to perform UPDATE operations on the
resource, to operate the VM.

Together, these two checks ensure that no VirtualMachine resource can be
created with an invalid maintenance-mode strategy, or with no
maintenance-mode strategy at all.
They also make sure that the maintenance-mode strategy can not be
removed or changed to an invalid value for existing VirtualMachine
resources.

related-to: harvester#6835

Signed-off-by: Moritz Röhrich <[email protected]>
@m-ildefons
Copy link
Contributor

There are several problems here:

  1. When you create a VirtualMachine from the UI and you leave the setting at the default (Migrate), the label isn't set at all
    This is actually not a big deal, since it can be fixed in the admission webhook by just letting the mutating webhook add the label if it's not already there.

  2. When you create a VirtualMachine from the UI and you do set the setting, or add the label in the YAML, then the label isn't transferred to the Pod. The UI (or the user) would need to set the label under .spec.template.metadata.labels, not under .metadata.labels in the VirtualMachine resource, or the admission webhook needs to make sure the label is transferred into the template.

  3. The node-drain controller only deletes Pods for VirtualMachines, when the Pod has the label and it's set to Migrate, omitting the default behavior, when the label isn't set at all. But it also assumes that all VirtualMachines that should be shut down have the label set and it's value is one of the other three valid settings. This leaves all Pods of VirtualMachines, which have an invalid value in the label or don't have the label set at all, dangling.

  4. The admission webhook doesn't check if the label is set, or if it has a valid value.

  5. (Optional) Go through the Terraform provider and check that the schema for the virtual machine resources enforces that only valid values are parsed from .tf files. Otherwise the .tf files should be rejected with a helpful error message.

m-ildefons added a commit to m-ildefons/harvester that referenced this issue Dec 13, 2024
Add checks for maintenance-mode strategy to VM webhooks.

The Admission webhooks for the VirtualMachine resource needs to make
sure that the maintenance-mode strategy for the VM is set to a sane
value.
To do this, there are two checks:

The first one is in the mutating webhook, which is executed first, and
it just makes sure that the label, which defines the maintenance-mode
strategy, is set. If the label is not set, the mutating webhook will set
it with the default value of `Migrate`. If the label is set, the
mutating webhook will not modify it, even if it has an invalid value.
This ensures that the maintenance-mode strategy is never set
unintentionally to a wrong value.
The second check will ensure that in this case the request is rejected
with an error message, so the user can correct the value of the
maintenance-mode strategy label.
The mutating webhook will also ensure that the maintenance-mode strategy
label is copied from the `.metadata.labels` to
`.spec.template.metadata.labels`. This is necessary to ensure that the
Pod in which the virtual machine will run will be labeled correctly.

The second check happens in the validating webhook. This check ensures
that the maintenance-mode strategy label is set to a valid
maintenance-mode strategy, i.e. one of the values:
  - Migrate
  - ShutdownAndRestartAfterEnable
  - ShutdownAndRestartAfterDisable
  - Shutdown
This webhook will deny a CREATE or UPDATE, if the new VirtualMachine
resource does not contain the maintenance-mode strategy label at all, or
if it contains an invalid value.
The only exception is an UPDATE to a VirtualMachine resource that
already contains an invalid value in the maintenance-mode strategy label
and where the value does not change. In this case, the webhook will
accept the request. This is crucial, in case the controller needs to
deal with an existing VM that has an invalid value in this label (e.g.
on a cluster that has been upgraded from an old version, before this
label was checked by the admission webhook). In this case, the
controller still needs to be able to perform UPDATE operations on the
resource, to operate the VM.

Together, these two checks ensure that no VirtualMachine resource can be
created with an invalid maintenance-mode strategy, or with no
maintenance-mode strategy at all.
They also make sure that the maintenance-mode strategy can not be
removed or changed to an invalid value for existing VirtualMachine
resources.

related-to: harvester#6835

Signed-off-by: Moritz Röhrich <[email protected]>
@m-ildefons
Copy link
Contributor

The maintenance-mode strategy was introduced in v1.4.0. @innobead should this be fixed on master branch only, i.e. for v1.5.0, or should this also be backported to fix it for v1.4.1?

@innobead
Copy link
Contributor

The maintenance-mode strategy was introduced in v1.4.0. @innobead should this be fixed on master branch only, i.e. for v1.5.0, or should this also be backported to fix it for v1.4.1?

yes, fix this in the master and add a backport label, and the corresponding backport issues will be created.

@harvesterhci-io-github-bot
Copy link
Collaborator

added backport-needed/1.4.1 issue: #7190.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admission-webhook Kubernetes validation and mutating admission webhooks area/node-maintenance backport-needed/1.4.1 kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/always Reproducible 100% of the time severity/3 Function working but has a major issue w/ workaround
Projects
None yet
Development

No branches or pull requests

6 participants