Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking problems after second kubeone run #1379

Closed
exolab opened this issue Jun 12, 2021 · 7 comments · Fixed by #1386
Closed

Networking problems after second kubeone run #1379

exolab opened this issue Jun 12, 2021 · 7 comments · Fixed by #1386
Labels
triage/support Indicates an issue that is a support question.

Comments

@exolab
Copy link

exolab commented Jun 12, 2021

I am using kubeone to set up a cluster on Hetzner cloud. After the initial run, things mostly work. There is a problem reaching kube-dns-upstream from pods on the worker nodes, but that can be fixed by doing a rolling restart of coredns.

However, after I run kubeone for a second time, canal gets redeployed, which then renders the cluster unusable. Every request seems to be taking 10 seconds (which of course makes debugging a pain).

This is how I execute kubeone (on both runs). kubeone.yaml and tf.json are identical on both runs. I am using kubeone 1.2.2 (and have also tried 1.2.1) and have tried with kubernetes 1.20 as well.

kubeone appl --auto-approve --debug --manifest kubeone.yaml -t tf.json
-- kubeone.yaml
apiVersion: kubeone.io/v1beta1
kind: KubeOneCluster
versions:
    kubernetes: 1.19.11
cloudProvider:
  hetzner: {}
  external: true
containerRuntime:
  containerd: {}
-- tf.json
{
  "kubeone_api": {
    "sensitive": false,
    "type": [
      "object",
      {
        "endpoint": "string"
      }
    ],
    "value": {
      "endpoint": "x.x.x.x"
    }
  },
  "kubeone_hosts": {
    "sensitive": false,
    "type": [
      "object",
      {
        "control_plane": [
          "object",
          {
            "bastion": "string",
            "cloud_provider": "string",
            "cluster_name": "string",
            "network_id": "string",
            "private_address": [
              "tuple",
              [
                "string",
                "string",
                "string"
              ]
            ],
            "public_address": "dynamic",
            "ssh_agent_socket": "string",
            "ssh_port": "number",
            "ssh_private_key_file": "string",
            "ssh_user": "string"
          }
        ]
      }
    ],
    "value": {
      "control_plane": {
        "bastion": "xxx.xxx.xxx.xxx",
        "cloud_provider": "hetzner",
        "cluster_name": "test-001",
        "network_id": "xxx",
        "private_address": [
          "x.x.x.x",
          "x.x.x.x",
          "x.x.x.x"
        ],
        "public_address": null,
        "ssh_agent_socket": "env:SSH_AUTH_SOCK",
        "ssh_port": 22,
        "ssh_private_key_file": "/builds/infrastructure/infrastructure.tmp/TF_VAR_CLUSTER_PRIVATE_KEY",
        "ssh_user": "root"
      }
    }
  },
  "kubeone_workers": {
    "sensitive": false,
    "type": [
      "object",
      {
        "test-001-pool1": [
          "object",
          {
            "providerSpec": [
              "object",
              {
                "cloudProviderSpec": [
                  "object",
                  {
                    "firewall": "string",
                    "image": "string",
                    "labels": [
                      "object",
                      {
                        "test-001-workers": "string"
                      }
                    ],
                    "location": "string",
                    "networks": [
                      "tuple",
                      [
                        "string"
                      ]
                    ],
                    "serverType": "string"
                  }
                ],
                "operatingSystem": "string",
                "operatingSystemSpec": [
                  "object",
                  {
                    "distUpgradeOnBoot": "bool"
                  }
                ],
                "sshPublicKeys": [
                  "tuple",
                  [
                    "string"
                  ]
                ]
              }
            ],
            "replicas": "number"
          }
        ]
      }
    ],
    "value": {
      "test-001-pool1": {
        "providerSpec": {
          "cloudProviderSpec": {
            "image": "ubuntu-20.04",
            "labels": {
              "test-001-workers": "pool1"
            },
            "location": "nbg1",
            "networks": [
              "x"
            ],
            "serverType": "ccx12"
          },
          "operatingSystem": "ubuntu",
          "operatingSystemSpec": {
            "distUpgradeOnBoot": false
          },
          "sshPublicKeys": [
            "ssh-ed25519 xxxxxxx"
          ]
        },
        "replicas": 1
      }
    }
  }
}

Does anyone have an idea what might be causing this? What logs should I look at specifically for debugging?

@exolab exolab added the triage/support Indicates an issue that is a support question. label Jun 12, 2021
@kron4eg
Copy link
Member

kron4eg commented Jun 13, 2021

@exolab please try this PR (you will have to compile it): #1380, this PR will be release as kubeone v1.2.3.

@exolab
Copy link
Author

exolab commented Jun 15, 2021

@kron4eg Thank you for the swift reply and PR. We did build kubeone ourselves and have since migrated to the released v1.2.3. However, we still see the same thing happening.

  1. We deploy the cluster.
  2. We start a container on a worker node, log in and can do nslookup google.com just fine, we get fast responses
  3. We run kubeone a second time. All the canal pods get terminated and restarted.
  4. We try to do nslookup again and get no response ;; connection timed out; no servers could be reached
  5. Looking at the node-local-dns logs on the worker node, we see that the kube-dns-upstream pod cannot be reached: [ERROR] plugin/errors: 2 route53.amazonaws.com.cluster.local. A: dial tcp 10.106.52.115:53: i/o timeout

This only happens after the second kubeone run.

@kron4eg
Copy link
Member

kron4eg commented Jun 15, 2021

We run kubeone a second time. All the canal pods get terminated and restarted.

are you using kubeone apply each time, or kubeone install?

@kron4eg
Copy link
Member

kron4eg commented Jun 15, 2021

I can't reproduce this issue. Can you please share the deployment used to spawn the pods.

@exolab
Copy link
Author

exolab commented Jun 15, 2021

We run kubeone a second time. All the canal pods get terminated and restarted.

are you using kubeone apply each time, or kubeone install?

We are using apply in both cases...

@exolab
Copy link
Author

exolab commented Jun 15, 2021

I can't reproduce this issue. Can you please share the deployment used to spawn the pods.

I am not entirely sure what you mean. This is how we are spawning the pod on the worker node:

kubectl run -i --tty dnsutils --image gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 --kubeconfig terraform/credentials/kubeconfig --restart=Never -- sh

@exolab
Copy link
Author

exolab commented Jun 16, 2021

@kron4eg I can confirm that after deploying a fresh cluster using the most recent patched version and then running kubeone apply a second time no longer leads to the problem I had.

Thank you so much for your impressively swift reaction and resolution, @kron4eg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/support Indicates an issue that is a support question.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants