Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHCP and bond #4574

Closed
p3lim opened this issue Nov 22, 2021 · 6 comments
Closed

DHCP and bond #4574

p3lim opened this issue Nov 22, 2021 · 6 comments

Comments

@p3lim
Copy link

p3lim commented Nov 22, 2021

Bug Report

Description

We have DHCP servers that distribute the hostnames for machines based on their static lease, and thus MAC. We also have a bond setup in Talos which is the recipient for those leases.

We've noticed that occationally the kubelet process gets stuck after a boot with the status:

Condition failed: 1 error occurred:

  • resource HostnameStatuses.net.talos.dev(network/hostname@undefined) doesn't exist.

While talosctl get hostname eventually shows the correct information as provided by DHCP, the kubelet gets stuck. In 0.14.0-alpha.1 we'll be able to restart the kubelet for this issue, but we'd rather see it get fixed, the kubelet should restart itself forever if it's not ready.

Logs

Please let me know if you want any specific logs, this issue is intermittent.

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
    • 0.14.0-alpha.0
  • Kubernetes version: [kubectl version --short]
    • 1.23.0-alpha.3
  • Platform:
    • x86_64
@smira
Copy link
Member

smira commented Nov 22, 2021

Do you have output of talosctl dmesg please?

And talosctl logs controller-runtime?

@p3lim
Copy link
Author

p3lim commented Nov 22, 2021

I'll see if I can grab that the next time I encounter it.

@uhthomas
Copy link
Contributor

Not sure if this is the exact same issue, but I've also run into trouble with bonds and DHCP.

There are 5 nodes, all of which are acquiring addresses from a DHCP server.

Sidero has decided they have these addresses:

❯ k get servers
NAME                                   HOSTNAME     ACCEPTED   CORDONED   ALLOCATED   CLEAN   POWER     AGE
4c4c4544-0042-5610-804b-b8c04f445831   10.0.0.243   true                  true        false   unknown   7d
4c4c4544-0047-4410-8034-b9c04f575631   10.0.0.242   true                  true        false   unknown   128m
4c4c4544-0054-3010-8056-c7c04f424232   10.0.0.248   true                  true        false   unknown   8d
4c4c4544-0054-3510-8057-c7c04f424232   10.0.0.249   true                  true        false   unknown   8d
4c4c4544-0057-4210-804c-c7c04f423432   10.0.0.246   true                  true        false   unknown   7d1h

However, once provisioned they have different addresses:

❯ k get no -owide
NAME            STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-6xb-myy   Ready    control-plane   15m     v1.26.1   10.0.0.241    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-8op-x2k   Ready    control-plane   3m30s   v1.26.1   10.0.0.243    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-nlu-hin   Ready    control-plane   23m     v1.26.1   10.0.0.250    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-rh9-xsk   Ready    control-plane   23m     v1.26.1   10.0.0.249    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12
talos-x7c-56v   Ready    control-plane   16m     v1.26.1   10.0.0.246    <none>        Talos (v1.3.0)   5.15.83-talos    containerd://1.6.12

etcd eventually becomes healthy.

❯ talosctl -n 10.0.0.246 health
discovered nodes: ["10.0.0.241" "10.0.0.243" "10.0.0.250" "10.0.0.100" "10.0.0.246"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: OK
waiting for all k8s nodes to report ready: ...
waiting for all k8s nodes to report ready: OK
waiting for all control plane static pods to be running: ...
waiting for all control plane static pods to be running: OK
waiting for all control plane components to be ready: ...
waiting for all control plane components to be ready: OK
waiting for kube-proxy to report ready: ...
waiting for kube-proxy to report ready: OK
waiting for coredns to report ready: ...
waiting for coredns to report ready: OK
waiting for all k8s nodes to report schedulable: ...
waiting for all k8s nodes to report schedulable: OK

Sidero then believes the control plane is not ready and clusterctl move fails.

❯ clusterctl move --kubeconfig-context=admin@cluster-bootstrap --to-kubeconfig=$HOME/.kube/config --to-kubeconfig-context=admin@cluster -v10
No default config file available
Performing move...
Discovering Cluster API objects
MachineDeployment Count=1
Secret Count=15
ConfigMap Count=1
ServerClass Count=1
ServerBinding Count=5
TalosConfigTemplate Count=1
TalosControlPlane Count=1
MetalCluster Count=1
MetalMachineTemplate Count=2
Cluster Count=1
MetalMachine Count=5
Environment Count=1
Machine Count=5
MachineSet Count=1
Server Count=5
TalosConfig Count=5
Total objects Count=51
Excluding secret from move (not linked with any Cluster) name="siderolink"
Error: failed to get object graph: failed to check for provisioned infrastructure: [cannot start the move operation while the control plane for "/, Kind=" default/cluster is not yet initialized, cannot start the move operation while "/, Kind=" default/cluster-cp-dbnpd is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-wrc8j is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-49wdk is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-89sgs is still provisioning the node, cannot start the move operation while "/, Kind=" default/cluster-cp-mmg97 is still provisioning the node]
sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster.(*objectMover).Move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster/mover.go:96
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/move.go:125
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).Move
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/move.go:97
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.runMove
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/move.go:101
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.glob..func16
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/move.go:59
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/[email protected]/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/[email protected]/command.go:1044
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/[email protected]/command.go:968
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.Execute
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/root.go:99
main.main
        sigs.k8s.io/cluster-api/cmd/clusterctl/main.go:27
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_amd64.s:1594

@smira
Copy link
Member

smira commented Feb 21, 2023

@uhthomas I don't think it's related to the issue here, I think clusterctl move has issues in general with Sidero which need to be analyzed/fixed.

@uhthomas
Copy link
Contributor

I see, do you have any advice on how I can debug my problem further? @smira

@smira
Copy link
Member

smira commented Feb 21, 2023

I see, do you have any advice on how I can debug my problem further? @smira

I don't have any exact advice, but probably digging into the error message, why it is there, what is wrong exactly, looking into the logs, etc.

@smira smira closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants