Networking problems after second kubeone run #1379

exolab · 2021-06-12T10:05:58Z

I am using kubeone to set up a cluster on Hetzner cloud. After the initial run, things mostly work. There is a problem reaching kube-dns-upstream from pods on the worker nodes, but that can be fixed by doing a rolling restart of coredns.

However, after I run kubeone for a second time, canal gets redeployed, which then renders the cluster unusable. Every request seems to be taking 10 seconds (which of course makes debugging a pain).

This is how I execute kubeone (on both runs). kubeone.yaml and tf.json are identical on both runs. I am using kubeone 1.2.2 (and have also tried 1.2.1) and have tried with kubernetes 1.20 as well.

kubeone appl --auto-approve --debug --manifest kubeone.yaml -t tf.json

-- kubeone.yaml
apiVersion: kubeone.io/v1beta1
kind: KubeOneCluster
versions:
    kubernetes: 1.19.11
cloudProvider:
  hetzner: {}
  external: true
containerRuntime:
  containerd: {}

-- tf.json
{
  "kubeone_api": {
    "sensitive": false,
    "type": [
      "object",
      {
        "endpoint": "string"
      }
    ],
    "value": {
      "endpoint": "x.x.x.x"
    }
  },
  "kubeone_hosts": {
    "sensitive": false,
    "type": [
      "object",
      {
        "control_plane": [
          "object",
          {
            "bastion": "string",
            "cloud_provider": "string",
            "cluster_name": "string",
            "network_id": "string",
            "private_address": [
              "tuple",
              [
                "string",
                "string",
                "string"
              ]
            ],
            "public_address": "dynamic",
            "ssh_agent_socket": "string",
            "ssh_port": "number",
            "ssh_private_key_file": "string",
            "ssh_user": "string"
          }
        ]
      }
    ],
    "value": {
      "control_plane": {
        "bastion": "xxx.xxx.xxx.xxx",
        "cloud_provider": "hetzner",
        "cluster_name": "test-001",
        "network_id": "xxx",
        "private_address": [
          "x.x.x.x",
          "x.x.x.x",
          "x.x.x.x"
        ],
        "public_address": null,
        "ssh_agent_socket": "env:SSH_AUTH_SOCK",
        "ssh_port": 22,
        "ssh_private_key_file": "/builds/infrastructure/infrastructure.tmp/TF_VAR_CLUSTER_PRIVATE_KEY",
        "ssh_user": "root"
      }
    }
  },
  "kubeone_workers": {
    "sensitive": false,
    "type": [
      "object",
      {
        "test-001-pool1": [
          "object",
          {
            "providerSpec": [
              "object",
              {
                "cloudProviderSpec": [
                  "object",
                  {
                    "firewall": "string",
                    "image": "string",
                    "labels": [
                      "object",
                      {
                        "test-001-workers": "string"
                      }
                    ],
                    "location": "string",
                    "networks": [
                      "tuple",
                      [
                        "string"
                      ]
                    ],
                    "serverType": "string"
                  }
                ],
                "operatingSystem": "string",
                "operatingSystemSpec": [
                  "object",
                  {
                    "distUpgradeOnBoot": "bool"
                  }
                ],
                "sshPublicKeys": [
                  "tuple",
                  [
                    "string"
                  ]
                ]
              }
            ],
            "replicas": "number"
          }
        ]
      }
    ],
    "value": {
      "test-001-pool1": {
        "providerSpec": {
          "cloudProviderSpec": {
            "image": "ubuntu-20.04",
            "labels": {
              "test-001-workers": "pool1"
            },
            "location": "nbg1",
            "networks": [
              "x"
            ],
            "serverType": "ccx12"
          },
          "operatingSystem": "ubuntu",
          "operatingSystemSpec": {
            "distUpgradeOnBoot": false
          },
          "sshPublicKeys": [
            "ssh-ed25519 xxxxxxx"
          ]
        },
        "replicas": 1
      }
    }
  }
}

Does anyone have an idea what might be causing this? What logs should I look at specifically for debugging?

The text was updated successfully, but these errors were encountered:

kron4eg · 2021-06-13T11:14:29Z

@exolab please try this PR (you will have to compile it): #1380, this PR will be release as kubeone v1.2.3.

exolab · 2021-06-15T09:01:11Z

@kron4eg Thank you for the swift reply and PR. We did build kubeone ourselves and have since migrated to the released v1.2.3. However, we still see the same thing happening.

We deploy the cluster.
We start a container on a worker node, log in and can do nslookup google.com just fine, we get fast responses
We run kubeone a second time. All the canal pods get terminated and restarted.
We try to do nslookup again and get no response ;; connection timed out; no servers could be reached
Looking at the node-local-dns logs on the worker node, we see that the kube-dns-upstream pod cannot be reached: [ERROR] plugin/errors: 2 route53.amazonaws.com.cluster.local. A: dial tcp 10.106.52.115:53: i/o timeout

This only happens after the second kubeone run.

kron4eg · 2021-06-15T12:30:30Z

We run kubeone a second time. All the canal pods get terminated and restarted.

are you using kubeone apply each time, or kubeone install?

kron4eg · 2021-06-15T12:55:10Z

I can't reproduce this issue. Can you please share the deployment used to spawn the pods.

exolab · 2021-06-15T12:55:29Z

We run kubeone a second time. All the canal pods get terminated and restarted.

are you using kubeone apply each time, or kubeone install?

We are using apply in both cases...

exolab · 2021-06-15T13:06:48Z

I can't reproduce this issue. Can you please share the deployment used to spawn the pods.

I am not entirely sure what you mean. This is how we are spawning the pod on the worker node:

kubectl run -i --tty dnsutils --image gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 --kubeconfig terraform/credentials/kubeconfig --restart=Never -- sh

exolab · 2021-06-16T07:19:29Z

@kron4eg I can confirm that after deploying a fresh cluster using the most recent patched version and then running kubeone apply a second time no longer leads to the problem I had.

Thank you so much for your impressively swift reaction and resolution, @kron4eg!

exolab added the triage/support Indicates an issue that is a support question. label Jun 12, 2021

kron4eg mentioned this issue Jun 15, 2021

Remove CNI patching #1386

Merged

kubermatic-bot closed this as completed in #1386 Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Networking problems after second kubeone run #1379

Networking problems after second kubeone run #1379

exolab commented Jun 12, 2021 •

edited

Loading

kron4eg commented Jun 13, 2021

exolab commented Jun 15, 2021

kron4eg commented Jun 15, 2021

kron4eg commented Jun 15, 2021

exolab commented Jun 15, 2021

exolab commented Jun 15, 2021

exolab commented Jun 16, 2021

Networking problems after second kubeone run #1379

Networking problems after second kubeone run #1379

Comments

exolab commented Jun 12, 2021 • edited Loading

kron4eg commented Jun 13, 2021

exolab commented Jun 15, 2021

kron4eg commented Jun 15, 2021

kron4eg commented Jun 15, 2021

exolab commented Jun 15, 2021

exolab commented Jun 15, 2021

exolab commented Jun 16, 2021

exolab commented Jun 12, 2021 •

edited

Loading