-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-1171] Implement CCCMO render to run CCM on bootstrap node #4947
Conversation
# Source config map is located in https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml | ||
bootkube_podman_run \ | ||
--name copy-cloud-controller-manager-images \ | ||
--rm -it \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need an interactive terminal on the bootstrap machine, right? And since in the happy case we will shortly remove the whole machine, I think we want to keep the metadata for debug logging. Can we drop this line?
# Copy the CCCMO images configMap to resolve pod images from internal registry in CCCMO render run | ||
# Source config map is located in https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml | ||
bootkube_podman_run \ | ||
--name copy-cloud-controller-manager-images \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: existing style seems to be an extra level of indentation for these wrapped lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error: error creating container storage: the container name "copy-cloud-controller-manager-images" is already in use by "5ca8294f8b2b2089541e267b4d46aeed08295f1ca16e14a45dd80807bf8e03ca". You have to remove that container to be able to reuse that name.: that name is already in use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment below, that is the root cause, https://github.com/openshift/installer/pull/4947/files#r662375716
--dest-dir=/assets/cloud-controller-manager-bootstrap \ | ||
--cluster-infrastructure-file=/assets/manifests/cluster-infrastructure-02-config.yml | ||
|
||
cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: //
-> /
cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests// | ||
|
||
touch cloud-controller-manager-bootstrap.done | ||
record_service_stage_success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inconsistent indent, although that appears to be the existing style for record_service_stage_success
in this file.
Since a fair amount of thought and testing has gone into this, I think it's worth explaining the deadlock that this is avoiding in more detail. Which operators and pods? Why would their being unable to come up on the tainted control plane node's block the cloud-controller-manager from coming up? |
@wking In my testing I saw CCCMO stuck in containerCreating state, as it was not able to pull the image. It might be that we should start with image-registry, the rest is under still under investigation. Would definitely document the findings |
Image-registry is for local images, and the release image and its references all come from external registries during bootstrapping. The local registry is one of the last components to come up, because it runs on computer node's, which are in turn provisioned by machine-API components hosted on control-plane nodes. |
@Danil-Grigorev Just want to make sure that you are aware that, this PR, in its current state, shouldn't merge until the CCCMO image has been added to the full payload, and are accounting for that in your planning around this. If you want to merge this before CCCMO is added to the full payload, you will need to ensure that the bootstrap script accounts for when there is no CCCMO image |
@JoelSpeed We would definitely look into this once again once render implementation would get merged in CCCMO and release addition request would be satisfied. The whole feature won't work until OpenStack, AWS and Azure cloud-provider repos won't be included in the release too. |
This allows master nodes to be initialized with CCM replica from bootstrap, speeding up process of cluster installation. Without CCM running on bootstrap node, most of the cluster operator and operand pods would have to tolerate new taint and a different configuration of node.kubernetes.io/not-ready (NoSchedule) taint: - effect: NoSchedule key: node.cloudprovider.kubernetes.io/uninitialized value: "true" - effect: NoSchedule key: node.kubernetes.io/not-ready Without this bootstrap addition or taint toleration the cluster with CCM would never become ready. To test this feature on AWS build installer and deploy on AWS. To create a cluster run: ```bash ./hack/build.sh ./bin/openshift-install create manifests cat <<EOF > manifests/manifest_feature_gate.yaml --- apiVersion: config.openshift.io/v1 kind: FeatureGate metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" name: cluster spec: customNoUpgrade: enabled: - ExternalCloudProvider - CSIMigrationAWS - CSIMigrationOpenStack featureSet: CustomNoUpgrade EOF OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/dgrigore/release:cccmo-render ./bin/openshift-install create cluster ```
dd6b99b
to
233896e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ran into a bug while testing
# Copy rendered resources to manifests folder | ||
cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests/ | ||
# Copy cloud config to /etc/kubernetes/bootstrap-configs | ||
cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will fail if no cloud config is generated, when i run on aws and there is no cloud config, i see this in the bootkube.service output:
Jul 01 13:29:18 ip-10-0-12-89 bootkube.sh[2207]: cp: cannot stat 'cloud-controller-manager-bootstrap/config/*': No such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this patch is allowing me to proceed
diff --git a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
index a36dbee28..3c3c61d6c 100755
--- a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
+++ b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
@@ -245,8 +245,10 @@ then
-c "cat /release-manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml" > manifests/ccm-images.yaml
ADDITIONAL_FLAGS=""
+ CLOUD_CONFIG_EXISTS=""
if [ -f "$PWD/manifests/cloud-provider-config.yaml" ]; then
ADDITIONAL_FLAGS="--cloud-config-file=/assets/manifests/cloud-provider-config.yaml"
+ CLOUD_CONFIG_EXISTS="yes"
fi
bootkube_podman_run \
@@ -262,7 +264,9 @@ then
# Copy rendered resources to manifests folder
cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests/
# Copy cloud config to /etc/kubernetes/bootstrap-configs
- cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs
+ if [ ! -z "$CLOUD_CONFIG_EXISTS" ]; then
+ cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs
+ fi
touch cloud-controller-manager-bootstrap.done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh... Bash weirdness. Apparently this throws an error you saw if the directory is empty, even if there is wildcard at the source for the copy.
cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs
This works instead, even if the directory is empty:
cp -r cloud-controller-manager-bootstrap/config/. /etc/kubernetes/bootstrap-configs
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
bootkube_podman_run \ | ||
--name copy-cloud-controller-manager-images \ | ||
--entrypoint /bin/bash "${RELEASE_IMAGE_DIGEST}" \ | ||
-c "cat /release-manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml" > manifests/ccm-images.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/release-manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml: No such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it a CI build? Seems like CCCMO is not present in the payload you were using
# Copy the CCCMO images configMap to resolve pod images from internal registry in CCCMO render run | ||
# Source config map is located in https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml | ||
bootkube_podman_run \ | ||
--name copy-cloud-controller-manager-images \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error: error creating container storage: the container name "copy-cloud-controller-manager-images" is already in use by "5ca8294f8b2b2089541e267b4d46aeed08295f1ca16e14a45dd80807bf8e03ca". You have to remove that container to be able to reuse that name.: that name is already in use
233896e
to
a45cffc
Compare
@Danil-Grigorev: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/close We have decided to take an alternate approach to bootstrapping which means this is no longer required. See openshift/enhancements#463 for more details on the updated approach |
@JoelSpeed: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This allows master nodes to be initialized with CCM replica from
bootstrap, speeding up process of cluster installation.
Without CCM running on bootstrap node, most of the cluster operator
and operand pods would have to tolerate new taint and a different
configuration of node.kubernetes.io/not-ready (NoSchedule) taint:
key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
Without this bootstrap addition or taint toleration the cluster with CCM
would never become ready.
To test this feature on AWS build installer and deploy on AWS.
To create a cluster run:
Gist with files CCCMO render outputs for AWS: https://gist.github.com/Danil-Grigorev/5732fd375dba7affff54eda267c4cd47