-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/asset/machines/worker: Structured workers #1481
pkg/asset/machines/worker: Structured workers #1481
Conversation
c0bc4dc
to
009963e
Compare
e2e-aws passed for c0bc4dce1, and I've just pushed c0bc4dce1 -> 009963e27, adding additional commits pushing this into Terraform and removing infrastructure from unused (at install-time) subnets. |
009963e
to
c06f889
Compare
ba76eea
to
8516a53
Compare
I've pushed 009963e -> 8516a53, rebasing onto master and fixing a segfault. Turns out |
e558f55
to
cf2e333
Compare
cf2e333
to
75f00f1
Compare
/lgtm @wking feel free to hold cancel if #1481 (comment) doesn't resonate with you. :) |
/hold |
6723bbd
to
c323003
Compare
/hold cancel Ok, I've pushed 6723bbd15 -> c32300383, inserting 549a704b15 to limit default MachineSets to the number needed to satisfy the configured |
There we go: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1481/pull-ci-openshift-installer-master-e2e-aws/4737/artifacts/e2e-aws/installer/.openshift_install.log | grep 'aws_nat.*Creation'
time="2019-03-28T06:50:04Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[1]: Creation complete after 1m36s (ID: nat-0d67c2bffb6621025)"
time="2019-03-28T06:50:14Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[2]: Creation complete after 1m46s (ID: nat-058c9c84f360675ca)2019-03-28T06:50:14.803Z [DEBUG] plugin.terraform-provider-aws: <key>expirationDate</key>"
time="2019-03-28T06:50:35Z" level=debug msg="module.vpc.aws_nat_gateway.nat_gw[0]: Creation complete after 2m6s (ID: nat-0e2918fa6c686c519)" And we're all green. |
/test e2e-aws |
Looks like of the tests that made it to the install we're around 50/50 |
/hold We should not be decreasing the default deployment, we have a way for customer's or CI to pick subset of azs. |
The final consumer was removed in 361c082 (govcloud: remove terraform configs, 2018-03-23, coreos/tectonic-installer#3134).
…hines Since f2eacf3 (asset/machines/master: allow adding MachineConfigs for control-plane machinepool, 2019-03-25, openshift#1150), we no longer need to filter a File slice to get a Master object's Machine files. Drop the obsoleted Machines() implemenation and rename the previous StructuredMachines implmentation to take its place.
Like the earlier 7a396d9 (pkg/asset/machines: Convert Master to a WriteableAsset, 2019-02-07, openshift#1211), but for Worker. This is groundwork for extracting configured availability zones when generating AWS Terraform variables, which will in turn allow us to avoid provisioning availability zones unless we are going to populate them with machines.
…(Set)s This commit updates our Terraform variables to include the worker subnets, and then switches on that (and the master zones) in Terraform to avoid creating subnet infrastructure (NAT gateways, routes, etc.) in zones that have no Machine(Set)s. This helps address limit issues in high-zone regions like us-east-1, as seen in the limits.md change. Note that without a reduction in our default MachineSet creation, the installer defaults will still not work on us-east-1 without a limit bump. The drawback is that users are now on the hook to provision their own subnets in other zones if they decide that they want to grow into a new zone as a day-2 Machine(Set) operation. For now, they'll have to provide their own infrastructure for that, and our user-provided-infrastructure docs should give them sufficient grounding to do so. It's possible that in the future the machine-API or other infrastructure operator could dynamically provision subnets in zones that were not populated at install-time, but I can't hazard a guess as to how likely that will be. The HCL functions for combining the zone lists are documented in [1,2]. [1]: https://www.terraform.io/docs/configuration-0-11/interpolation.html#concat-list1-list2- [2]: https://www.terraform.io/docs/configuration-0-11/interpolation.html#distinct-list-
c323003
to
644f705
Compare
I'm not convinced yet ;). But we seem to be on the same page for the rest of this PR, so I've pushed c32300383 -> 644f705, rebasing onto master, dropping 549a704b1, and adjusting the docs accordingly. I'll circle back with 549a704b1 in a follow-up PR, and we can hash this out there. |
/lgtm |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
forgot to lift hold /hold cancel |
/retest |
This is very similar to the earlier e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204). This time, however, I'm not changing the zones where the machines will run. By default, the installer will provisioning zone infrastructure in all available zones, but since openshift/installer@644f705286 (data/aws/vpc: Only create subnet infrastucture for zones with Machine(Set)s, 2019-03-27, openshift/installer#1481) users who explicitly set zones in their install-config will no longer have unused zones provisioned with subnets, NAT gateways, EIPs, and other related infrastructure. This infrastructure reduction has two benefits in CI: 1. We don't have to pay for resources that we won't use, and we will have more room under our EIP limits (although we haven't bumped into that one in a while, because we're VPC-constained). 2. We should see reduced rates in clusters failing install because of AWS rate limiting, with results like [1]: aws_route.to_nat_gw.3: Error creating route: timeout while waiting for state to become 'success' (timeout: 2m0s) The reduction is because: i. We'll be making fewer requests for these resources, because we won't need to create (and subsequently tear down) as many of them. This will reduce our overall AWS-API load somewhat, although the reduction will be incremental because we have so many other resources which are not associated with zones. ii. Throttling for these per-zone resources are the ones that tend to break Terraform [2]. So even if the rate of timeouts per-API request remains unchanged, a given cluster will only have half as many (three vs. the old six) per-zone chances of hitting one of the timeouts. This should give us something close to a 50% reduction in clusters hitting throttling timeouts. The drawback is that we're diverging further from the stock "I just called 'openshift-install create cluster' without providing an install-config.yaml" experience. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/187/pull-ci-openshift-console-operator-master-e2e-aws-operator/575/artifacts/e2e-aws-operator/installer/.openshift_install.log [2]: With a cache of build-log.txt from the past ~48 hours: $ grep -hr 'timeout while waiting for state' ~/.cache/openshift-deck-build-logs >timeouts $ wc -l timeouts 362 timeouts $ grep aws_route_table_association timeouts | wc -l 214 $ grep 'aws_route\.to_nat_gw' timeouts | wc -l 102 So (102+214)/362 is 87% of our timeouts, with the remainder being almost entirely related to the internet gateway (which is not per-zone).
Building on #1150 (which will hopefully squeak through CI any moment now), this pull request is structuring our
Worker
(as 7a396d9, #1211, did forMaster
). This is groundwork for extracting configured availability zones when generating AWS Terraform variables, which will in turn allow us to avoid provisioning availability zones unless we are going to populate them with machines.