-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E test restarting Autoscaler and Activator #2170
Conversation
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josephburnett: 2 warnings.
In response to this:
Fixes #2080
Proposed Changes
- Add an E2E test which restarts the Autoscaler and verifies scale-from-zero still works.
- Same for the Activator.
- Add a controller check in
DiagnoseMe
to verify the controllers are running.Note: this will pass once #2159 is checked in.
Release Note
NONE
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
test/e2e/autoscale_test.go
Outdated
} | ||
|
||
if successfulRequests != totalRequests { | ||
return fmt.Errorf("Error making requests for scale up. Got %d successful requests. Wanted %d.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.
test/e2e/autoscale_test.go
Outdated
|
||
ctx.logger.Infof("Waiting for all requests to complete.") | ||
if err := group.Wait(); err != nil { | ||
return fmt.Errorf("Error making requests for scale up: %v.", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.
/hold Waiting for #2159 to be submitted. |
The following is the coverage report on pkg/.
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two kinda small comments, I'm a little concerned about the Sleep
call.
Great overall though, I like the readability of the tests themselves! Will be great to have those.
test/e2e/autoscale_test.go
Outdated
if err != nil { | ||
logger.Fatalf("Error during initial scale up: %v", err) | ||
ctx.logger.Fatalf("Error during initial scale up: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be ctx.t.Fatalf
? Otherwise this won't show up as a failed test, will it? Applies to the next occurence as well (line 159)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should be t.Fatal.
test/e2e/autoscale_test.go
Outdated
} | ||
ctx.logger.Infof("Deployment %q has been bounced.", name) | ||
time.Sleep(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we wait for the pod to become ready? What else is this sleep waiting for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the autoscaler, allowing for the pods to reconnect. That's not represented in pod readiness, so I just sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here on why there is a sleep? Also, if there is better solution or suggestion, we should create an issue and link it here to remove the sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this again without the sleep, and it passes. So I've just removed this.
test/e2e/autoscale_test.go
Outdated
|
||
logger.Infof("All %d requests succeeded.", totalRequests) | ||
return nil | ||
type testContext struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is only applicable to the autoscale test or we should move it to common test/e2e.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's from #2164. Now that it's submitted, I'll merge and remove the WIP label.
The changes seem similar to #2164 Are you waiting on one of them to merge before the other? |
@srinivashegde86 and @markusthoemmes, I've removed the part you were concerned about. Turns out it isn't necessary. |
/approve Changes LGTM. I will let markus add it lgtm label once he takes a look. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: josephburnett, srinivashegde86 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Looks like a TestBuildAndServe flake. |
/test pull-knative-serving-integration-tests |
Now TestHelloWorld failed. I don't think I broke it that badly! |
If you look at the raw build log - for example https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/2170/pull-knative-serving-integration-tests/1052657354991996928/build-log.txt - you'll see Line 84 in d841ed1
|
Passes with 40 minute timeout. I know that's a big bump, but I don't want to adjust the cluster-wide scale-to-zero threshold and I'm doing three tests that require scaling to zero. I can probably improve that by adding annotations to override scale-to-zero values, but that's a bit much for this pull request. @markusthoemmes and @srinivashegde86 could you take a look and LGTM? |
logger.Errorf("Unable to get %v deployment: %v", name, err) | ||
continue | ||
} | ||
if dep.Status.AvailableReplicas != 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activator
is not a controller, neither is webhook
. Can we just remove this?
/lgtm |
CLA wise I guess you'll have to rebase out my old commits? |
@@ -81,7 +81,7 @@ kubectl create namespace serving-tests | |||
options="" | |||
(( EMIT_METRICS )) && options="-emitmetrics" | |||
report_go_test \ | |||
-v -tags=e2e -count=1 -timeout=20m \ | |||
-v -tags=e2e -count=1 -timeout=40m \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to split these tests. 30+ minutes for the whole test suite is starting to hurt productivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller reboot tests cannot run in parallel with other tests. But most of the other tests can. Should we separate the longer tests from the short ones?
Alternatively, we can implement controllers for each namespace: #2107 and run tests in different namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@evankanderson, @mattmoor, @vaikas-google what do you think about running separate controllers in each namespace (configurable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is long lead (longer than we should wait).
@adrcunha what is your recommendation for splitting things up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller reboot tests cannot run in parallel with other tests. But most of the other tests can. Should we separate the longer tests from the short ones?
Don't the conformance and e2e tests already run in parallel on the same cluster? Or am I misremembering the current test setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right @bbrowning, the "DiagnoseMe" output actually proves that there are occasionally tests running in parallel to the AutoscaleUpDown test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrcunha, what should we do here? Split out the reboot tests that have to run serially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickest fix is to run the reboot tests after the other e2e/conformance tests. But I feel that we actually need to split our e2e test job to avoid the 30+ run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the required infra to run these tests separate from the e2e job, and the upgrade tests are already using it.
Actually, is there any downside to lowering the cluster-wide scale-to-zero threshold for all tests to something less than 5 minutes? We don't want to lower it during a test, but is there a downside to lowering it before starting the tests? |
/hold Putting in a hold so this doesn't accidentally slip in. |
@bbrowning We used to do precisely this and it made all of the tests flakier. We could consider revisiting this as I can think of a number of sources of potential flakes that have been squashed lately. cc @josephburnett @adrcunha thoughts? |
Before we go back to modifying the config map, we should fix #2155 which was the source of some BlueGreen failures (@tanzeeb). But after that, yes, let's revisit config map modification. |
Can we revive this now that our tests run in sequence? |
Better putting them in a separate job, I'm afraid they'll make the integration tests runtime too long. |
closing this stale PR, reopen if still interesting. |
Fixes #2080
Proposed Changes
DiagnoseMe
to verify the controllers are running.Note: this will pass once #2159 is checked in.
Release Note