Migrating from kube-aws 0.14 to 0.15 issues #1837

paalkr · 2020-02-19T11:29:47Z

@andersosthus and I have test migrated a few clusters from kube-aws 0.14.3 to 0.15.1/0.15.2, and we have discovered a few issues

v0.15.x branch: Make cloud-controller-manager an experimental feature #1832 introduced a problem with the cloud-controller-manager v0.15.1 - Cloud-Controller-Manager - configure-cloud-routes is not disabled #1833. The issue was fixed in Added back in --configure-cloud-routes=false #1834 and included in the kube-aws 0.15.2 release, thanks @davidmccormick
Using etcd.memberIdentityProvider: eni introduces a problem when cleaning up the etcd stack after migration, because the control-plane stack imports the Etcd0PrivateIP, Etcd1PrivateIP and Etcd2PrivateIP exports. These exports are not part of the rendered etcd CloudFormation stacks in 0.15. A temporary workaround is to edit the etcd.json.tmpl after dong a render stack, to temporary add in the missing exports. This will make sure that that update can continue, and then the added values can be removed and a new updated issued.

  },
  "Outputs": {
    "Etcd0PrivateIP": {
      "Description": "The private IP for etcd node 0",
      "Value": "10.9.151.115",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd0PrivateIP"
        }
      }
    },
    "Etcd1PrivateIP": {
      "Description": "The private IP for etcd node 1",
      "Value": "10.9.180.114",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd1PrivateIP"
        }
      }
    },
    "Etcd2PrivateIP": {
      "Description": "The private IP for etcd node 2",
      "Value": "10.9.219.3",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd2PrivateIP"
        }
      }
    },    
    "StackName": {
      "Description": "The name of this stack which is used by node pool stacks to import outputs from this stack",
      "Value": { "Ref": "AWS::StackName" }
    }
    {{range $index, $etcdInstance := $.EtcdNodes }},
    "{{$etcdInstance.LogicalName}}FQDN": {
      "Description": "The FQDN for etcd node {{$index}}",
      "Value": {{$etcdInstance.AdvertisedFQDN}}
    }
    {{- end}}
    {{range $n, $r := .ExtraCfnOutputs -}}
    ,
    {{quote $n}}: {{toJSON $r}}
    {{- end}}
  }
}

The export-existing-etcd-state.service responsible for exporting from the old etcd cluster and preparing the export files on disk in /var/run/coreos/etcdadm/snapshots takes so long that the CF-stack might do a rollback. Even on a close to "empty" cluster the migration can take many minutes. Migrating a very small cluster with only a few resource took 45 minutes.

The etcd stack rollback timeout is based on the CreateTimeout of the controller https://github.com/kubernetes-incubator/kube-aws/blob/b34d9b69069321111d3ca3e24c53fdba8ccecd2c/builtin/files/stack-templates/etcd.json.tmpl#L365, which is a little confusing. You will actually have to increase the controller.createTimeout to increase the etcd wait time.

CloudFormation does not allow for more then 60 minutes wait time, so I fear that the etcd migration process will not work for lager clusters.
Using a WaitCondition https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-waitcondition.html that receives a heartbeat signal from the migration script might be a functional approach.

  WaitForEtcdMigration:
    Type: AWS::CloudFormation::WaitCondition
    CreationPolicy:
      ResourceSignal:
        Timeout: PT2H # can be more then 60 minutes
        Count: 1

The text was updated successfully, but these errors were encountered:

fejta-bot · 2020-05-19T11:43:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

dominicgunn · 2020-05-28T11:46:31Z

/remove-lifecycle stale

fejta-bot · 2020-08-26T12:17:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jorge07 · 2020-08-27T10:58:45Z

I think this is important enough to /remove-lifecycle stale

fejta-bot · 2020-09-26T11:21:30Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2020

cblecker closed this as completed Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating from kube-aws 0.14 to 0.15 issues #1837

Migrating from kube-aws 0.14 to 0.15 issues #1837

paalkr commented Feb 19, 2020

fejta-bot commented May 19, 2020

dominicgunn commented May 28, 2020

fejta-bot commented Aug 26, 2020

jorge07 commented Aug 27, 2020

fejta-bot commented Sep 26, 2020

Migrating from kube-aws 0.14 to 0.15 issues #1837

Migrating from kube-aws 0.14 to 0.15 issues #1837

Comments

paalkr commented Feb 19, 2020

fejta-bot commented May 19, 2020

dominicgunn commented May 28, 2020

fejta-bot commented Aug 26, 2020

jorge07 commented Aug 27, 2020

fejta-bot commented Sep 26, 2020