E2E test restarting Autoscaler and Activator #2170

josephburnett · 2018-10-05T22:11:55Z

Fixes #2080

Proposed Changes

Add an E2E test which restarts the Autoscaler and verifies scale-from-zero still works.
Same for the Activator.
Add a controller check in DiagnoseMe to verify the controllers are running.

Note: this will pass once #2159 is checked in.

Release Note

NONE

…nto bouncetest

googlebot · 2018-10-05T22:11:58Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

knative-prow-robot

@josephburnett: 2 warnings.

In response to this:

Fixes #2080

Proposed Changes

Add an E2E test which restarts the Autoscaler and verifies scale-from-zero still works.

Same for the Activator.

Add a controller check in DiagnoseMe to verify the controllers are running.

Note: this will pass once #2159 is checked in.

Release Note
NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot · 2018-10-05T22:12:05Z

test/e2e/autoscale_test.go

+	}
+
+	if successfulRequests != totalRequests {
+		return fmt.Errorf("Error making requests for scale up. Got %d successful requests. Wanted %d.",


Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.

knative-prow-robot · 2018-10-05T22:12:06Z

test/e2e/autoscale_test.go

+
+	ctx.logger.Infof("Waiting for all requests to complete.")
+	if err := group.Wait(); err != nil {
+		return fmt.Errorf("Error making requests for scale up: %v.", err)


Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.

josephburnett · 2018-10-05T22:24:58Z

/hold

Waiting for #2159 to be submitted.

knative-metrics-robot · 2018-10-05T22:26:02Z

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/activator/revision.go	82.5%	82.1%	-0.4
pkg/reconciler/v1alpha1/autoscaling/kpa_scaler.go	68.0%	79.5%	11.5

josephburnett · 2018-10-08T20:56:25Z

/retest

markusthoemmes

Two kinda small comments, I'm a little concerned about the Sleep call.

Great overall though, I like the readability of the tests themselves! Will be great to have those.

markusthoemmes · 2018-10-09T12:05:55Z

test/e2e/autoscale_test.go

 	if err != nil {
-		logger.Fatalf("Error during initial scale up: %v", err)
+		ctx.logger.Fatalf("Error during initial scale up: %v", err)


Should this be ctx.t.Fatalf? Otherwise this won't show up as a failed test, will it? Applies to the next occurence as well (line 159)

Yes, that should be t.Fatal.

markusthoemmes · 2018-10-09T12:31:05Z

test/e2e/autoscale_test.go

 	}
+	ctx.logger.Infof("Deployment %q has been bounced.", name)
+	time.Sleep(10 * time.Second)


Should we wait for the pod to become ready? What else is this sleep waiting for?

In the case of the autoscaler, allowing for the pods to reconnect. That's not represented in pod readiness, so I just sleep.

can you add a comment here on why there is a sleep? Also, if there is better solution or suggestion, we should create an issue and link it here to remove the sleep.

I've tested this again without the sleep, and it passes. So I've just removed this.

srinivashegde86 · 2018-10-09T17:08:22Z

test/e2e/autoscale_test.go


-	logger.Infof("All %d requests succeeded.", totalRequests)
-	return nil
+type testContext struct {


Do you think this is only applicable to the autoscale test or we should move it to common test/e2e.go?

That's from #2164. Now that it's submitted, I'll merge and remove the WIP label.

srinivashegde86 · 2018-10-09T17:21:58Z

The changes seem similar to #2164

Are you waiting on one of them to merge before the other?

josephburnett · 2018-10-17T19:30:05Z

@srinivashegde86 and @markusthoemmes, I've removed the part you were concerned about. Turns out it isn't necessary.

srinivashegde86 · 2018-10-17T19:55:58Z

/approve

Changes LGTM. I will let markus add it lgtm label once he takes a look.

knative-prow-robot · 2018-10-17T19:56:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: josephburnett, srinivashegde86

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [srinivashegde86]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

josephburnett · 2018-10-17T20:26:18Z

Looks like a TestBuildAndServe flake.

josephburnett · 2018-10-17T20:26:58Z

/test pull-knative-serving-integration-tests

josephburnett · 2018-10-17T21:08:00Z

Now TestHelloWorld failed. I don't think I broke it that badly!

bbrowning · 2018-10-17T21:47:54Z

If you look at the raw build log - for example https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/2170/pull-knative-serving-integration-tests/1052657354991996928/build-log.txt - you'll see panic: test timed out after 20m0s. It looks like with the new tests the e2e tests are just taking longer than 20 minutes. You can bump this to a higher number at

serving/test/e2e-tests.sh

Line 84 in d841ed1

-v -tags=e2e -count=1 -timeout=20m \

josephburnett · 2018-10-24T05:26:16Z

Passes with 40 minute timeout. I know that's a big bump, but I don't want to adjust the cluster-wide scale-to-zero threshold and I'm doing three tests that require scaling to zero. I can probably improve that by adding annotations to override scale-to-zero values, but that's a bit much for this pull request.

@markusthoemmes and @srinivashegde86 could you take a look and LGTM?

markusthoemmes · 2018-10-24T06:42:26Z

test/e2e/diagnose.go

+			logger.Errorf("Unable to get %v deployment: %v", name, err)
+			continue
+		}
+		if dep.Status.AvailableReplicas != 1 {


This will need some adjustment in #2171. @mattmoor fyi

activator is not a controller, neither is webhook. Can we just remove this?

markusthoemmes · 2018-10-24T06:43:24Z

/lgtm

markusthoemmes · 2018-10-24T07:22:30Z

CLA wise I guess you'll have to rebase out my old commits?

adrcunha · 2018-10-24T09:09:53Z

test/e2e-tests.sh

@@ -81,7 +81,7 @@ kubectl create namespace serving-tests
 options=""
 (( EMIT_METRICS )) && options="-emitmetrics"
 report_go_test \
-  -v -tags=e2e -count=1 -timeout=20m \
+  -v -tags=e2e -count=1 -timeout=40m \


We need to split these tests. 30+ minutes for the whole test suite is starting to hurt productivity.

The controller reboot tests cannot run in parallel with other tests. But most of the other tests can. Should we separate the longer tests from the short ones?

Alternatively, we can implement controllers for each namespace: #2107 and run tests in different namespaces.

@evankanderson, @mattmoor, @vaikas-google what do you think about running separate controllers in each namespace (configurable)?

That is long lead (longer than we should wait).

@adrcunha what is your recommendation for splitting things up?

The controller reboot tests cannot run in parallel with other tests. But most of the other tests can. Should we separate the longer tests from the short ones?

Don't the conformance and e2e tests already run in parallel on the same cluster? Or am I misremembering the current test setup?

I think you're right @bbrowning, the "DiagnoseMe" output actually proves that there are occasionally tests running in parallel to the AutoscaleUpDown test.

@adrcunha, what should we do here? Split out the reboot tests that have to run serially?

Quickest fix is to run the reboot tests after the other e2e/conformance tests. But I feel that we actually need to split our e2e test job to avoid the 30+ run time.

I added the required infra to run these tests separate from the e2e job, and the upgrade tests are already using it.

bbrowning · 2018-10-25T00:55:07Z

Passes with 40 minute timeout. I know that's a big bump, but I don't want to adjust the cluster-wide scale-to-zero threshold and I'm doing three tests that require scaling to zero. I can probably improve that by adding annotations to override scale-to-zero values, but that's a bit much for this pull request.

Actually, is there any downside to lowering the cluster-wide scale-to-zero threshold for all tests to something less than 5 minutes? We don't want to lower it during a test, but is there a downside to lowering it before starting the tests?

markusthoemmes · 2018-10-31T08:53:02Z

/hold

Putting in a hold so this doesn't accidentally slip in.

mattmoor · 2018-10-31T15:53:36Z

@bbrowning We used to do precisely this and it made all of the tests flakier. We could consider revisiting this as I can think of a number of sources of potential flakes that have been squashed lately.

cc @josephburnett @adrcunha thoughts?

josephburnett · 2018-10-31T16:10:54Z

We could consider revisiting this as I can think of a number of sources of potential flakes that have been squashed lately. @mattmoor

Before we go back to modifying the config map, we should fix #2155 which was the source of some BlueGreen failures (@tanzeeb). But after that, yes, let's revisit config map modification.

markusthoemmes · 2019-02-08T12:02:26Z

Can we revive this now that our tests run in sequence?

adrcunha · 2019-02-08T15:00:15Z

Better putting them in a separate job, I'm afraid they'll make the integration tests runtime too long.

mattmoor · 2019-05-06T18:29:20Z

closing this stale PR, reopen if still interesting.

markusthoemmes and others added 8 commits October 5, 2018 15:55

Report incoming requests in the activator.

c434ec4

Remove activation of revision in the activator.

1db7ac6

Allow the autoscaler to report a constant 0.

fbad4cf

Activate revision in the KPA.

c5ff298

Simplify scale-to-zero logic, adjust tests.

00bbfba

Refactor test stages into functions.

9799f5b

Test bouncing activator and autoscaler.

87b6642

Merge remote-tracking branch 'markusthoemmes/autoscaler-activation' i…

ee4e2cf

…nto bouncetest

josephburnett requested a review from markusthoemmes October 5, 2018 22:11

knative-prow-robot requested review from mdemirhan and srinivashegde86 October 5, 2018 22:12

knative-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 5, 2018

knative-prow-robot reviewed Oct 5, 2018

View reviewed changes

josephburnett added 2 commits October 5, 2018 18:22

Lint fixes.

bbaa0a9

Merge branch 'e2estages' into bouncetest

b207241

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2018

Merge branch 'master' into bouncetest

4f72ad9

knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 8, 2018

markusthoemmes suggested changes Oct 9, 2018

View reviewed changes

srinivashegde86 reviewed Oct 9, 2018

View reviewed changes

josephburnett added 2 commits October 17, 2018 14:21

Merge branch 'master' into bouncetest

e8043fc

Use t.Fatalf instead of logger.Fatalf.

2683aee

knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2018

srinivashegde86 approved these changes Oct 17, 2018

View reviewed changes

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2018

josephburnett added 2 commits October 23, 2018 19:34

Merge branch 'master' into bouncetest

2a221d1

Increase E2E test timeout to 40 minutes.

eef2a39

markusthoemmes reviewed Oct 24, 2018

View reviewed changes

knative-prow-robot assigned markusthoemmes Oct 24, 2018

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 24, 2018

markusthoemmes mentioned this pull request Oct 24, 2018

Bump the activator to 3 replicas to test horizontal scalability. #2171

Merged

adrcunha reviewed Oct 24, 2018

View reviewed changes

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2018

mattmoor closed this May 6, 2019

E2E test restarting Autoscaler and Activator #2170

E2E test restarting Autoscaler and Activator #2170

Conversation

josephburnett commented Oct 5, 2018

Proposed Changes

googlebot commented Oct 5, 2018

knative-prow-robot left a comment

Choose a reason for hiding this comment

Proposed Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josephburnett commented Oct 5, 2018

knative-metrics-robot commented Oct 5, 2018

josephburnett commented Oct 8, 2018

markusthoemmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srinivashegde86 commented Oct 9, 2018

josephburnett commented Oct 17, 2018

srinivashegde86 commented Oct 17, 2018

knative-prow-robot commented Oct 17, 2018

josephburnett commented Oct 17, 2018

josephburnett commented Oct 17, 2018

josephburnett commented Oct 17, 2018

bbrowning commented Oct 17, 2018 • edited Loading

josephburnett commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markusthoemmes commented Oct 24, 2018

markusthoemmes commented Oct 24, 2018

Choose a reason for hiding this comment

josephburnett Oct 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bbrowning commented Oct 25, 2018

markusthoemmes commented Oct 31, 2018

mattmoor commented Oct 31, 2018

josephburnett commented Oct 31, 2018 • edited Loading

markusthoemmes commented Feb 8, 2019

adrcunha commented Feb 8, 2019

mattmoor commented May 6, 2019

bbrowning commented Oct 17, 2018 •

edited

Loading

josephburnett Oct 24, 2018 •

edited

Loading

josephburnett commented Oct 31, 2018 •

edited

Loading