Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote_machine_controller cannot update machine #921

Open
dl-mai opened this issue Feb 13, 2025 · 0 comments
Open

remote_machine_controller cannot update machine #921

dl-mai opened this issue Feb 13, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@dl-mai
Copy link

dl-mai commented Feb 13, 2025

What happened?

When using the k0s infrastructure provider in conjunction with rke2 provider. The machine provisioning fails often with:

2025-02-13T15:59:18+01:00	ERROR	Failed to update Machine	{"controller": "remotemachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "RemoteMachine", "RemoteMachine": {"name":"remote-test-cp-template-p54jk","namespace":"example-cluster"}, "namespace": "example-cluster", "name": "remote-test-cp-template-p54jk", "reconcileID": "2ec57921-7a95-4b9c-b5a0-41bd4160f320", "remotemachine": {"name":"remote-test-cp-template-p54jk","namespace":"example-cluster"}, "machine": "remote-test-6d6gz", "error": "Operation cannot be fulfilled on machines.cluster.x-k8s.io \"remote-test-6d6gz\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/k0sproject/k0smotron/internal/controller/infrastructure.(*RemoteMachineController).Reconcile

Involved Areas

No response

What did you expect to happen?

Machine should be in state Running

Steps to reproduce

Apply this Manifest to CAPI Cluster

---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: RKE2ControlPlane
metadata:
  name: remote-test
  namespace: example-cluster
spec:
  files:
    - path: "/var/lib/rancher/rke2/server/manifests/coredns-config.yaml"
      owner: "root:root"
      permissions: "0640"
      content: |
        apiVersion: helm.cattle.io/v1
        kind: HelmChartConfig
        metadata:
          name: rke2-coredns
          namespace: kube-system
        spec:
          valuesContent: |-
            tolerations:
              - key: "node.cloudprovider.kubernetes.io/uninitialized"
                value: "true"
                effect: "NoSchedule"
    - path: "/var/lib/rancher/rke2/server/manifests/kubevip.yaml"
      owner: "root:root"
      permissions: "0640"
      content: |
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: kube-vip
          namespace: kube-system
        ---
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRole
        metadata:
          annotations:
            rbac.authorization.kubernetes.io/autoupdate: "true"
          name: system:kube-vip-role
        rules:
          - apiGroups: [""]
            resources: ["services", "services/status", "nodes"]
            verbs: ["list","get","watch", "update"]
          - apiGroups: ["coordination.k8s.io"]
            resources: ["leases"]
            verbs: ["list", "get", "watch", "update", "create"]
        ---
        kind: ClusterRoleBinding
        apiVersion: rbac.authorization.k8s.io/v1
        metadata:
          name: system:kube-vip-binding
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: system:kube-vip-role
        subjects:
        - kind: ServiceAccount
          name: kube-vip
          namespace: kube-system
        ---
        apiVersion: v1
        kind: Pod
        metadata:
          creationTimestamp: null
          name: kube-vip
          namespace: kube-system
        spec:
          tolerations:
          - effect: NoSchedule
            key: node.cloudprovider.kubernetes.io/uninitialized
            operator: Exists
          containers:
          - args:
            - manager
            env:
            - name: cp_enable
              value: "true"
            - name: vip_interface
              value: ens2
            - name: address
              value: 192.168.178.199
            - name: port
              value: "6443"
            - name: vip_arp
              value: "true"
            - name: vip_leaderelection
              value: "true"
            - name: vip_leaseduration
              value: "15"
            - name: vip_renewdeadline
              value: "10"
            - name: vip_retryperiod
              value: "2"
            image: ghcr.io/kube-vip/kube-vip:v0.5.5
            imagePullPolicy: IfNotPresent
            name: kube-vip
            resources: {}
            securityContext:
              capabilities:
                add:
                - NET_ADMIN
                - NET_RAW
            volumeMounts:
            - mountPath: /etc/rancher/rke2/rke2.yaml
              name: kubeconfig
          hostAliases:
          - hostnames:
            - kubernetes
            ip: 127.0.0.1
          hostNetwork: true
          serviceAccountName: kube-vip
          volumes:
          - hostPath:
              path: /etc/rancher/rke2/rke2.yaml
              type: File
            name: kubeconfig
    - path: /root/create_provider_id.sh
      content: |
        #!/bin/sh
        target_dir=/etc/rancher/rke2/config.yaml.d/
        mkdir -p $target_dir
        cat <<EOF >>$target_dir/kubelet-provider-id.yaml
        kubelet-arg:
          - provider-id=remote-machine://$(hostname -I | cut -d' ' -f1):22
        EOF

      owner: root:root
      permissions: '0755'
  replicas: 2
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: RemoteMachineTemplate
      name: remote-test-cp-template
      namespace: example-cluster
  preRKE2Commands:
    - bash -c /root/create_provider_id.sh
  postRKE2Commands:
  version: v1.30.9+rke2r1
  agentConfig: {}
  serverConfig:
    cni: calico
    kubeAPIServer:
      extraArgs:
        - --anonymous-auth=true
    tlsSan:
      - example-cluster-k8s-api.home
      - 192.168.168.199
  registrationMethod: "internal-first"
  rolloutStrategy:
    type: "RollingUpdate"
    rollingUpdate:
      maxSurge: 1

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: RemoteCluster
metadata:
  name: remote-test
  namespace: example-cluster
spec:
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: remote-test-cluster
  namespace: example-cluster
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 10.244.0.0/16
    serviceDomain: cluster.local
    services:
      cidrBlocks:
        - 10.128.0.0/12
  controlPlaneEndpoint:
    host: 192.168.178.199
    port: 6443
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: RKE2ControlPlane
    name: remote-test
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: RemoteCluster
    name: remote-test
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: RemoteMachineTemplate
metadata:
  name: remote-test-cp-template
  namespace: example-cluster
spec:
  template:
    spec:
      pool: default
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: PooledRemoteMachine
metadata:
  name: remote-test-0
  namespace: example-cluster
spec:
  pool: default
  machine:
    address: 192.168.178.42
    port: 22
    user: root
    sshKeyRef:
      name: footloose-key
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: PooledRemoteMachine
metadata:
  name: remote-test-1
  namespace: example-cluster
spec:
  pool: default
  machine:
    address: 192.168.178.46
    port: 22
    user: root
    sshKeyRef:
      name: footloose-key
---
apiVersion: v1
kind: Secret
metadata:
  name: footloose-key
  namespace: example-cluster
data:
  value: <sshkey>

k0smotron version

1.4.1

k0s version

Used RKE2

Anything else we need to know?

I debugged the issue by running k0smotron on my IDE against the cluster. The Line where the Errors is thrown have a retry for 409 conflict.
But i think the issue is at:

https://github.com/k0sproject/k0smotron/blob/main/internal/controller/infrastructure/remote_machine_controller.go#L266

It uses client.Merge and retries are still failing. When changing it to client.MergeFrom(machine) the provisioning succeeds consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant