Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run regression tests in github action in kind #2997

Closed
kdorosh opened this issue May 13, 2020 · 6 comments · Fixed by #3066 or #3355
Closed

run regression tests in github action in kind #2997

kdorosh opened this issue May 13, 2020 · 6 comments · Fixed by #3066 or #3355
Assignees
Labels
Type: Enhancement New feature or request

Comments

@kdorosh
Copy link
Contributor

kdorosh commented May 13, 2020

Is your feature request related to a problem? Please describe.
Allows for clean slate for each CI run

Describe alternatives you've considered
Spin up a new GKE cluster for each CI run

Additional context
Would make CI a lot more resilient

@kdorosh kdorosh added the Type: Enhancement New feature or request label May 13, 2020
@ashleywang1
Copy link
Contributor

ashleywang1 commented May 26, 2020

Regression tests are those that require the env variable RUN_KUBE2E_TESTS=1.

There are 2 kinds of regression tests found in the test/kube2e/... folder: those that require CLUSTER_LOCK_TESTS=1 set to true, and those that don't.

Running the cluster-lock tests in a Github action, along with setup, takes ~ 22m 9s (see: https://github.com/solo-io/gloo/runs/709923886)
Running the non-cluster-lock tests in GKE, along with setup, and the normal tests, takes ~ 23 min 22 sec (see: https://console.cloud.google.com/cloud-build/builds/37ff754e-cb38-4662-b97f-418b8c032614;tab=detail?project=solo-public)

There are 2 issues with running all regression tests in the github action:
1 - the github action appears to be slower. Without the non-cluster-lock regression tests in GKE, the CI finishes in ~15 min 29 sec (https://console.cloud.google.com/cloud-build/builds/0d2a38e8-ad25-4fc4-8bf0-9ba94c27396a;tab=detail?project=solo-public).
2 - the following error appears when running the non-cluster-lock tests in the github action but not in GKE:

{"level":"error","ts":1590173021.9420695,"logger":"gateway.v1.event_loop.gateway.gateway-validation-webhook","caller":"k8sadmisssion/validating_admission_webhook.go:262","msg":"Validation failed: Validating v1.VirtualService failed: validating *v1.VirtualService {echo-vs gateway-test-2862-1}: failed to communicate with Gloo Proxy validation server: All attempts fail:\n...}

Attempts were made to use https://github.com/nektos/act and reproduce this error locally, but we're using the setup-helm and setup-kind github actions, which are poorly handled because they need the unzip binary and the docker binary.

If the end goal here is to improve CI time, we should:

  • run the cluster-lock regression tests in a Github Action
  • run the tests and non-cluster-lock regression tests in GKE.

@ashleywang1
Copy link
Contributor

ashleywang1 commented May 27, 2020

I'll leave this issue open, as we still want to figure out how to run the non-cluster-lock regression tests in a Github Action.

Here is another example with extra outputs for the regression test failure: https://github.com/solo-io/gloo/pull/3066/checks?check_run_id=711013005

The reproducible error looks like:

{"level":"error","ts":1590173021.9420695,"logger":"gateway.v1.event_loop.gateway.gateway-validation-webhook","caller":"k8sadmisssion/validating_admission_webhook.go:262","msg":"Validation failed: Validating v1.VirtualService failed: validating *v1.VirtualService {echo-vs gateway-test-2862-1}: failed to communicate with Gloo Proxy validation server: All attempts fail:\n...}

One possible cause is that gloo has an internal error:

LOGS FROM gateway-test-1697-1.gloo-xx:
{"level":"error","ts":1590531877.6860743,"logger":"gloo.v1.event_loop.gloo","caller":"syncer/setup_syncer.go:570","msg":"err in metrics server","version":"kind","error":"context canceled","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/syncer.RunGlooWithExtensions.func6\n\t/home/runner/work/gloo/gloo/projects/gloo/pkg/syncer/setup_syncer.go:570"} 

@EItanya
Copy link
Contributor

EItanya commented May 27, 2020

Does this error also occur each time when running locally in KinD, or only in github actions @ashleywang1?

@ashleywang1
Copy link
Contributor

I'm also really concerned about this flake, with a Github Action only running the cluster-lock regression test case:

It ran for 45 minutes, failed, and didn't output any logs: https://github.com/solo-io/gloo/pull/3066/checks?check_run_id=713448708

@ashleywang1
Copy link
Contributor

ashleywang1 commented May 27, 2020

The error with the regression tests turned out to be this:

      creating kube resource echo-vs: admission webhook "gateway.gateway-test-7231-1.svc" denied the request: Validating v1.VirtualService failed: validating *v1.VirtualService {echo-vs gateway-test-7231-1}: failed to communicate with Gloo Proxy validation server: All attempts fail:
      #1: All attempts fail:
      #1: rpc error: code = Internal desc = grpc: error while marshaling: proto: core.ResourceRef does not implement Marshal
      #2: All attempts fail:

Here is where the error is coming from https://github.com/grpc/grpc-go/blob/6b9bf4296edc5fae722a5dff887a954ffc599b12/rpc_util.go#L547

After talking with @EItanya, it seemed that we're "somehow using the wrong marshaller or the wrong version of the correct marshaller", specifically, the gogo proto golang lib.

The reason for this is because we were running the following setup step:

make update-deps
go get github.com/onsi/ginkgo/ginkgo@latest
./ci/kind.sh

We noticed a diff in go.mod and go.sum that included protobuf dependencies:

+github.com/golang/protobuf v1.4.0-rc.1/go.mod h1:ceaxUfeHdC40wWswd/P6IGgMaK3YpKi5j83Wpe3EHw8=
+github.com/golang/protobuf v1.4.0-rc.1.0.20200221234624-67d41d38c208/go.mod h1:xKAWHe0F5eneWXFV3EuXVDTCmh+JuBKY0li0aMyXATA=

Fixing it to this setup step fixed the issue:

./ci/kind.sh
GO111MODULE=off go get -u github.com/onsi/ginkgo/ginkgo

@kdorosh
Copy link
Contributor Author

kdorosh commented May 29, 2020

reopening as we are moving to kind in stages. right now cluster lock tests are the only ones running in kind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
3 participants