Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Migrating from kube-aws 0.14 to 0.15 issues #1837

Closed
paalkr opened this issue Feb 19, 2020 · 5 comments
Closed

Migrating from kube-aws 0.14 to 0.15 issues #1837

paalkr opened this issue Feb 19, 2020 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@paalkr
Copy link
Contributor

paalkr commented Feb 19, 2020

@andersosthus and I have test migrated a few clusters from kube-aws 0.14.3 to 0.15.1/0.15.2, and we have discovered a few issues

  1. v0.15.x branch: Make cloud-controller-manager an experimental feature #1832 introduced a problem with the cloud-controller-manager v0.15.1 - Cloud-Controller-Manager - configure-cloud-routes is not disabled #1833. The issue was fixed in Added back in --configure-cloud-routes=false #1834 and included in the kube-aws 0.15.2 release, thanks @davidmccormick

  2. Using etcd.memberIdentityProvider: eni introduces a problem when cleaning up the etcd stack after migration, because the control-plane stack imports the Etcd0PrivateIP, Etcd1PrivateIP and Etcd2PrivateIP exports. These exports are not part of the rendered etcd CloudFormation stacks in 0.15. A temporary workaround is to edit the etcd.json.tmpl after dong a render stack, to temporary add in the missing exports. This will make sure that that update can continue, and then the added values can be removed and a new updated issued.

  },
  "Outputs": {
    "Etcd0PrivateIP": {
      "Description": "The private IP for etcd node 0",
      "Value": "10.9.151.115",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd0PrivateIP"
        }
      }
    },
    "Etcd1PrivateIP": {
      "Description": "The private IP for etcd node 1",
      "Value": "10.9.180.114",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd1PrivateIP"
        }
      }
    },
    "Etcd2PrivateIP": {
      "Description": "The private IP for etcd node 2",
      "Value": "10.9.219.3",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd2PrivateIP"
        }
      }
    },    
    "StackName": {
      "Description": "The name of this stack which is used by node pool stacks to import outputs from this stack",
      "Value": { "Ref": "AWS::StackName" }
    }
    {{range $index, $etcdInstance := $.EtcdNodes }},
    "{{$etcdInstance.LogicalName}}FQDN": {
      "Description": "The FQDN for etcd node {{$index}}",
      "Value": {{$etcdInstance.AdvertisedFQDN}}
    }
    {{- end}}
    {{range $n, $r := .ExtraCfnOutputs -}}
    ,
    {{quote $n}}: {{toJSON $r}}
    {{- end}}
  }
}
  1. The export-existing-etcd-state.service responsible for exporting from the old etcd cluster and preparing the export files on disk in /var/run/coreos/etcdadm/snapshots takes so long that the CF-stack might do a rollback. Even on a close to "empty" cluster the migration can take many minutes. Migrating a very small cluster with only a few resource took 45 minutes.

The etcd stack rollback timeout is based on the CreateTimeout of the controller https://github.com/kubernetes-incubator/kube-aws/blob/b34d9b69069321111d3ca3e24c53fdba8ccecd2c/builtin/files/stack-templates/etcd.json.tmpl#L365, which is a little confusing. You will actually have to increase the controller.createTimeout to increase the etcd wait time.

CloudFormation does not allow for more then 60 minutes wait time, so I fear that the etcd migration process will not work for lager clusters.
Using a WaitCondition https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-waitcondition.html that receives a heartbeat signal from the migration script might be a functional approach.

  WaitForEtcdMigration:
    Type: AWS::CloudFormation::WaitCondition
    CreationPolicy:
      ResourceSignal:
        Timeout: PT2H # can be more then 60 minutes
        Count: 1
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2020
@dominicgunn
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2020
@jorge07
Copy link
Contributor

jorge07 commented Aug 27, 2020

I think this is important enough to /remove-lifecycle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants