Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci:[CNI] ACN PR Pipeline speed improvements #2077

Merged
merged 13 commits into from
Aug 14, 2023
Merged

Conversation

jpayne3506
Copy link
Contributor

@jpayne3506 jpayne3506 commented Jul 25, 2023

Reason for Change:

Improves the time it takes to run the ACN PR Pipeline by roughly 30 minutes by:

  • Moving cluster creation out of the individual E2E steps ~15 minutes
  • Lowering Scaleup for windows load test ~ 4 minutes
  • Decreasing the retry time for faster updates ~ 2 minutes
  • Removing echo-external-node deployment after Cilium connectivity test ~ 9 minutes (17 minutes total for the individual step)

Issue Fixed:

Requirements:

Notes:
To re-run a failed E2E stage, start from the appropriate cluster creation stage.

@jpayne3506 jpayne3506 added cni Related to CNI. ci Infra or tooling. labels Jul 25, 2023
@jpayne3506 jpayne3506 requested review from vipul-21 and camrynl July 25, 2023 21:30
@jpayne3506 jpayne3506 requested a review from a team as a code owner July 25, 2023 21:30
@vipul-21
Copy link
Contributor

There is a caveat to this approach in terms of usability. We no longer can simply rerun the E2E test in the pipeline(in case it fails due to transient issue, which it does a lot with goldpinger and stuff). To restart we will need to rerun the cluster step associated with that E2E test.
Maybe we pull out the delete cluster part as well ? ( won't be testing on new setup on retry though)

@jpayne3506
Copy link
Contributor Author

won't be testing on new setup on retry though)

In another PR we could test using "counter" as a mechanism for people that are re-running tests often within the pipeline. Would need fine tuning, but I believe that deletion should be a priority as the speedups/optimizations are worthless if the cluster sits around for any additional time (possibly 15 days) than needed.

@vipul-21
Copy link
Contributor

won't be testing on new setup on retry though)

In another PR we could test using "counter" as a mechanism for people that are re-running tests often within the pipeline. Would need fine tuning, but I believe that deletion should be a priority as the speedups/optimizations are worthless if the cluster sits around for any additional time (possibly 15 days) than needed.

I agree completely that delete is priority 1. To clarify by “pulling out delete cluster part” i meant removing from respective e2e step template and into a separate template like creat cluster. (And not move out of pipeline)

@jpayne3506 jpayne3506 force-pushed the jpayne3506/prspeedup branch from c519c45 to 702bb07 Compare July 26, 2023 18:48
@jpayne3506 jpayne3506 requested a review from tamilmani1989 July 26, 2023 19:02

- template: singletenancy/cilium-overlay/cilium-overlay-e2e-job-template.yaml
parameters:
name: "cilium_overlay_cilium_e2e"
displayName: Cilium on AKS Overlay
pipelineBuildImage: "$(BUILD_IMAGE)"
testDropgz: ""
clusterType: overlay-no-kube-proxy-up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cilium-overlay-up


- template: singletenancy/aks-swift/e2e-job-template.yaml
parameters:
name: "aks_swift_e2e"
displayName: AKS Swift Ubuntu
pipelineBuildImage: "$(BUILD_IMAGE)"
testDropgz: ""
clusterType: byocni-up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swift-podsubnet-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go ahead and rename the first two to cilium-podsubnet-up and cilium-overlay-up. I think we could keep this one as byocni-up, because swift-podsubnet-up might be misleading. We have an existing target called swift-up which brings up a swift cluster.

If you don't want to keep byocni-up -- Maybe we could go with swift-byocni-up, to clearly indicate that it's byocni but just need to make sure it matches in all other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swift-byocni-up makes sense

@@ -158,6 +149,9 @@ steps:
- script: |
echo "validate pod IP assignment and check systemd-networkd restart"
kubectl get pod -owide -A
# Deleting echo-external-node deployment until cilium version matches TODO. https://github.com/cilium/cilium-cli/issues/67 is addressing the change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what this test is for?

Copy link
Contributor Author

@jpayne3506 jpayne3506 Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tries to echo (curl in this case) to an external, non-cilium, node. The deployment is stuck in pending due to its node selector looking for "cilium.io/no-schedule": "true". We cannot skip the test at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now its fine, we should contribute to cilium add skip check to this test case @vipul-21

@jpayne3506 jpayne3506 force-pushed the jpayne3506/prspeedup branch 17 times, most recently from 7140910 to bd0e76d Compare July 31, 2023 19:18
@jpayne3506 jpayne3506 force-pushed the jpayne3506/prspeedup branch from 8c943e7 to 52d659a Compare August 11, 2023 18:39
@jpayne3506 jpayne3506 force-pushed the jpayne3506/prspeedup branch from 52d659a to e2396eb Compare August 14, 2023 02:49
@jpayne3506 jpayne3506 enabled auto-merge (squash) August 14, 2023 02:49
@jpayne3506 jpayne3506 merged commit 2740187 into master Aug 14, 2023
@jpayne3506 jpayne3506 deleted the jpayne3506/prspeedup branch August 14, 2023 17:08
@rbtr
Copy link
Contributor

rbtr commented Sep 12, 2023

I just discovered this change because I was not getting the clusters I was expecting out of the hack/Makefile anymore.

@jpayne3506 it is incredibly confusing and misleading to have overlay-cilium-up and cilium-overlay-up, and swift-cilium-up and cilium-podsubnet-up.
Especially since these new targets are not even Cilium (--network-plugin none, no --network-dataplane arg) and are simply BYO CNI with the addition of the kube-proxy config.
You probably could have modified the existing BYO CNI targets to add the kube-proxy flag if an env var was set. Short of that, these must be clearly named as "byocni-nokubeproxy" flavors of the existing targets. Lmk if you want to fix this, otherwise we can revert it.
cc @camrynl and @vipul-21 since you approved

@vipul-21
Copy link
Contributor

@rbtr we did raise the same questions during the PR review: #2077 (comment), but afaik @tamilmani1989 wanted to use the cilium name.

@rbtr
Copy link
Contributor

rbtr commented Sep 12, 2023

cilium in the target name is lying, and you should push back on that. (TM was okay with "byocni".)

But again, we might not have needed separate make targets for this anyway since the difference is just the addition of the kube-proxy flag. This is the kind of modification that can be done off of an env var

@camrynl
Copy link
Contributor

camrynl commented Sep 12, 2023

@jpayne3506 is going to update the makefile with more explicit names for each target

@rbtr
Copy link
Contributor

rbtr commented Sep 12, 2023

thanks!

jpayne3506 added a commit that referenced this pull request Feb 27, 2024
* Move Create Cluster

* Lower Windows scaleup

* Delete echo-external-node for Cilium validate

* Change cluster creation from job to stage

* Decrease k8utils retry delay time

* Adjust submodule pipeline

* Remove test comments

* Addressing Comments

* Add delete cluster template

* Single Stage Delete

* ACN merge commitID fix

* Addressing Comments

* Refactor dualstack from #2098
jpayne3506 added a commit that referenced this pull request Feb 27, 2024
* Move Create Cluster

* Lower Windows scaleup

* Delete echo-external-node for Cilium validate

* Change cluster creation from job to stage

* Decrease k8utils retry delay time

* Adjust submodule pipeline

* Remove test comments

* Addressing Comments

* Add delete cluster template

* Single Stage Delete

* ACN merge commitID fix

* Addressing Comments

* Refactor dualstack from #2098
jpayne3506 added a commit that referenced this pull request Mar 1, 2024
* ci:[CNI] ACN PR Pipeline speed improvements (#2077)

* Move Create Cluster

* Lower Windows scaleup

* Delete echo-external-node for Cilium validate

* Change cluster creation from job to stage

* Decrease k8utils retry delay time

* Adjust submodule pipeline

* Remove test comments

* Addressing Comments

* Add delete cluster template

* Single Stage Delete

* ACN merge commitID fix

* Addressing Comments

* Refactor dualstack from #2098

* ci: Changing Service Connection for PR pipeline (#2153)

* ci: change service connection

* add: change ACN PR azureSubscription

* ci: Replace manual install of kubectl with ADO KubectlInstaller task (#2307)

* ci: remove kubectl install

* ci: add KubectlInstaller for kubectl

* ci: Add k8s conformance tests to Cilium CI/CD (#2348)

* ci: add k8s conformance

* ci: remove kubetest2 k8se2e

* chore: remove test comments

* ci: add k8se2e to nightly pipeline

* ci: Replace k8s conformance tests within aks-swift CI (#2590)

ci: k8se2e for aks-swift

* ci: v4overlay conformance test cases (#2274)

v4overlay conformance test cases

* fix: Match Cilium CLI to Cilium Agent (#2365)

* fix: Match Cilium CLI to Cilium Agent

* ci: Capture single digit versions

* ci: Update makefile cluster creation (#2358)

ci: Update aks cluster creation

---------

Co-authored-by: Paul Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling. cni Related to CNI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants