-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent Hangs at crane.Push() on Registry Push #2104
Comments
I did some testing on this and here's what i found:
While I have not found a smoking gun for this, the testing I've done seems to indicate it might be related to the default RKE2 CNI. |
Yeah that is what we are leaning to after some internal testing as well - a potentially interesting data point - do you ever see this issue with https://docs.zarf.dev/docs/the-zarf-cli/cli-commands/zarf_package_mirror-resources#examples (for the internal registry you can take the first example and swap the passwords and the package - if you don't have git configured just omit that set of flags) |
(a potential addition to the theory is that other things in the cluster may be stressing it as well) |
Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen. |
We've always had agent nodes when we saw this issue, whether with 1 or 3 control plane nodes. We've never seen this issue on single node clusters. Haven't tried a cluster with only 3 control plane nodes and no agent nodes. |
Just to add another data point from what we've seen - we can deploy OK with multi-node clusters but only if the nodes are all RKE2 servers. As soon as we make one an agent, the Zarf registry runs there and we see this behavior as well. |
Additional agent nodes are OK but we've tainted those so the Zarf registry doesn't run there. |
I can confirm that adding a nodeSelector and taint/toleration to schedule the zarf registry pod(s) on the RKE2 control plane node(s) does
|
…2190) ## Description This PR fixes error channel handling for Zarf tunnels so lost pod connections don't result in infinite spins. This should mostly resolve 2104 though not marking it "Fixes" as depending on how many pod connection errors occur a deployment could still run out of retries. ## Related Issue Relates to #2104 ## Type of change - [X] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [ ] Test, docs, adr added or updated as needed - [X] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed
Just wanted to chime in and say that this problem is still reproducible with the changes in #2190. |
Just noting still encountering this on RKE2 with EBS backed PVCs. Not really any additional details on how/why we encountered this but we were able to workaround by pushing the image that was hanging "manually"/via a small zarf package. EDIT: To clarify this was a zarf package that we built with a single component containing the single image that commonly stalled on deploy. Then we create/deploy-ed it and once finished, we deployed our "real" zarf package and it sped past the image push. Not sure why this worked better, but it seemed to consistently help when we hit stalling images. |
This is a super longstanding issue upstream that we've been band-aiding for a few years (in Kubernetes land). The root of the issue is that SPDY is long dead but used for all streaming functionality in Kubernetes. The current port forward logic depends on SPDY and a implementation that is overdue for a rewrite. KEP 4006 should be an actual fix as we replace SPDY. We are currently building mitigations into Zarf to try and address this. What we really need is an environment where we can replicate the issue and test different fixes. If anyone has any ideas... Historically we've been unable to reproduce this. |
This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream). |
(also thanks to @benmountjoy111 and @docandrew for the .pcap files!) |
## Description This adds `--backoff` and `--retries` to package operations to allow those to be configured. ## Related Issue Relates to #2104 ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [X] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed Signed-off-by: Eddie Zaneski <[email protected]> Co-authored-by: Eddie Zaneski <[email protected]>
Sadly, I do not think this solves the issue. I am still experiencing timeouts when publishing images. I am noticing that Zarf is now explicitly timing out instead of just hanging forever though. ![]() |
kubernetes/kubernetes#117493 should fix this upstream. Hopefully we can get it merged and backported. |
Following up here to see if there's any more clarity on the exact issue we're facing...based on the above comments it seems like the current suspicion is that the issue originates from the kubectl port-forward/tunneling? Is that accurate @eddiezane ? In some testing on our environment we've consistently had failures with large image pushes. This is happening in the context of a UDS bundle, so not directly zarf but it's effectively just looping through each package to deploy. Our common error looks like the one above with timeouts currently. We have however had success pushing the images two different ways:
I think where I'm confused in all this is that I'd assume either of these workarounds would hit the same limitations with port-forwarding/tunneling. Is there anything to glean from this experience that might help explain the issue better or why these methods seem to work far more consistently? As @YrrepNoj mentioned above, we're able to hit this pretty much 100% consistently with our bundle deploy containing the Leapfrog images and haven't found any success outside of these workarounds. |
I am seeing the same behavior as @RyanTepera1 using RKE2 One additional thing I've noticed is that using For example, pushing this sonarqube image took nearly 5 minutes to get from 39% to 41%, but after killing the zarf-docker-registry pod, it pushes the image in less than a minute. |
Moving the zarf-docker-registry pods to one of the RKE2 master nodes as suggested here did not improve performance for my deployment. I tried this with |
@YrrepNoj provided the following workaround using Docker Push instead.
|
I'm running into the same issue... I'm trying to load in the rook/ceph images, e.i. the I also tried with a vanilla images, as uds is baking in the repo1 images, and still getting the same error.. Images I am using:
|
Related to #2864 |
Performance has significantly improved for me using |
@philiversen is correct, using variables:
- name: CONTROL_PLANE_ONE_ADDRESS
prompt: true
description: This is the IP/Hostname of a control-plane
components:
...
actions:
onDeploy:
before:
- cmd: |-
./zarf package mirror-resources zarf-init-amd64-*.tar.zst \
--registry-url ${ZARF_VAR_CONTROL_PLANE_ONE_ADDRESS}:${ZARF_NODEPORT} \
--registry-push-username zarf-push \
--registry-push-password ${ZARF_REGISTRY_AUTH_PUSH} |
Environment
Device and OS: Rocky 8 EC2
App version: 0.29.2
Kubernetes distro being used: RKE2 v1.26.9+rke2r1
Other: Bigbang v2.11.1
Steps to reproduce
zarf package deploy zarf-package-mvp-cluster-amd64-v5.0.0-alpha.7.tar.zst --confirm -l=debug
crane.Push()
. A retry usually works.Expected result
That the
zarf package deploy...
command wouldn't get hung up, and continue along.Actual Result
The
zarf package deploy...
command gets hung upVisual Proof (screenshots, videos, text, etc)
Severity/Priority
There is a workaround, by keeping retrying until the process succeeds.
Additional Context
This looks exactly like #1568, which was closed.
We have a multi-node cluster on AWS EC2, our package size is about 2.9G. Here are a few things that we noticed after some extensive testing:
The text was updated successfully, but these errors were encountered: