DHCP and bond #4574

p3lim · 2021-11-22T15:49:55Z

Bug Report

Description

We have DHCP servers that distribute the hostnames for machines based on their static lease, and thus MAC. We also have a bond setup in Talos which is the recipient for those leases.

We've noticed that occationally the kubelet process gets stuck after a boot with the status:

Condition failed: 1 error occurred:

resource HostnameStatuses.net.talos.dev(network/hostname@undefined) doesn't exist.

While talosctl get hostname eventually shows the correct information as provided by DHCP, the kubelet gets stuck. In 0.14.0-alpha.1 we'll be able to restart the kubelet for this issue, but we'd rather see it get fixed, the kubelet should restart itself forever if it's not ready.

Logs

Please let me know if you want any specific logs, this issue is intermittent.

Environment

Talos version: [talosctl version --nodes <problematic nodes>]
- 0.14.0-alpha.0
Kubernetes version: [kubectl version --short]
- 1.23.0-alpha.3
Platform:
- x86_64

The text was updated successfully, but these errors were encountered:

smira · 2021-11-22T16:12:58Z

Do you have output of talosctl dmesg please?

And talosctl logs controller-runtime?

p3lim · 2021-11-22T16:35:43Z

I'll see if I can grab that the next time I encounter it.

uhthomas · 2023-02-21T00:51:31Z

Not sure if this is the exact same issue, but I've also run into trouble with bonds and DHCP.

There are 5 nodes, all of which are acquiring addresses from a DHCP server.

Sidero has decided they have these addresses:

❯ k get servers
NAME                                   HOSTNAME     ACCEPTED   CORDONED   ALLOCATED   CLEAN   POWER     AGE
4c4c4544-0042-5610-804b-b8c04f445831   10.0.0.243   true                  true        false   unknown   7d
4c4c4544-0047-4410-8034-b9c04f575631   10.0.0.242   true                  true        false   unknown   128m
4c4c4544-0054-3010-8056-c7c04f424232   10.0.0.248   true                  true        false   unknown   8d
4c4c4544-0054-3510-8057-c7c04f424232   10.0.0.249   true                  true        false   unknown   8d
4c4c4544-0057-4210-804c-c7c04f423432   10.0.0.246   true                  true        false   unknown   7d1h

However, once provisioned they have different addresses:

❯ k get no -owide
NAME            STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-6xb-myy   Ready    control-plane   15m     v1.26.1   10.0.0.241    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-8op-x2k   Ready    control-plane   3m30s   v1.26.1   10.0.0.243    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-nlu-hin   Ready    control-plane   23m     v1.26.1   10.0.0.250    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-rh9-xsk   Ready    control-plane   23m     v1.26.1   10.0.0.249    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-x7c-56v   Ready    control-plane   16m     v1.26.1   10.0.0.246    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12

etcd eventually becomes healthy.

❯ talosctl -n 10.0.0.246 health
discovered nodes: ["10.0.0.241" "10.0.0.243" "10.0.0.250" "10.0.0.100" "10.0.0.246"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

Sidero then believes the control plane is not ready and clusterctl move fails.

❯ clusterctl move --kubeconfig-context=admin@cluster-bootstrap --to-kubeconfig=$HOME/.kube/config --to-kubeconfig-context=admin@cluster -v10
No default config file available
Performing move...
Discovering Cluster API objects
MachineDeployment Count=1
Secret Count=15
ConfigMap Count=1
ServerClass Count=1
ServerBinding Count=5
TalosConfigTemplate Count=1
TalosControlPlane Count=1
MetalCluster Count=1
MetalMachineTemplate Count=2
Cluster Count=1
MetalMachine Count=5
Environment Count=1
Machine Count=5
MachineSet Count=1
Server Count=5
TalosConfig Count=5
Total objects Count=51
Excluding secret from move (not linked with any Cluster) name="siderolink"
Error: failed to get object graph: failed to check for provisioned infrastructure: [cannot start the move operation while the control plane for "/, Kind=" default/cluster is not yet initialized, cannot start the move operation while "/, Kind=" default/cluster-cp-dbnpd is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-wrc8j is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-49wdk is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-89sgs is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-mmg97 is still provisioning the node]
sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster.(*objectMover).Move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster/mover.go:96
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/move.go:125
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).Move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/move.go:97
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.runMove
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/move.go:101
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.glob..func16
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/move.go:59
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:1044
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:968
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.Execute
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/root.go:99
main.main
        sigs.k8s.io/cluster-api/cmd/clusterctl/main.go:27
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_amd64.s:1594

smira · 2023-02-21T10:32:02Z

@uhthomas I don't think it's related to the issue here, I think clusterctl move has issues in general with Sidero which need to be analyzed/fixed.

uhthomas · 2023-02-21T13:34:22Z

I see, do you have any advice on how I can debug my problem further? @smira

smira · 2023-02-21T16:34:35Z

I see, do you have any advice on how I can debug my problem further? @smira

I don't have any exact advice, but probably digging into the error message, why it is there, what is wrong exactly, looking into the logs, etc.

p3lim mentioned this issue Dec 14, 2021

Kubelet cannot be restarted via API if it's not already running #4665

Closed

uhthomas mentioned this issue Feb 22, 2023

NodeRef is missing from machines siderolabs/sidero#1062

Open

smira closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DHCP and bond #4574

DHCP and bond #4574

p3lim commented Nov 22, 2021

smira commented Nov 22, 2021

p3lim commented Nov 22, 2021

uhthomas commented Feb 21, 2023

smira commented Feb 21, 2023

uhthomas commented Feb 21, 2023

smira commented Feb 21, 2023

DHCP and bond #4574

DHCP and bond #4574

Comments

p3lim commented Nov 22, 2021

Bug Report

Description

Logs

Environment

smira commented Nov 22, 2021

p3lim commented Nov 22, 2021

uhthomas commented Feb 21, 2023

smira commented Feb 21, 2023

uhthomas commented Feb 21, 2023

smira commented Feb 21, 2023