Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E test restarting Autoscaler and Activator #2170

Closed
wants to merge 16 commits into from

Conversation

josephburnett
Copy link
Contributor

Fixes #2080

Proposed Changes

  • Add an E2E test which restarts the Autoscaler and verifies scale-from-zero still works.
  • Same for the Activator.
  • Add a controller check in DiagnoseMe to verify the controllers are running.

Note: this will pass once #2159 is checked in.

Release Note

NONE

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@knative-prow-robot knative-prow-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 5, 2018
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josephburnett: 2 warnings.

In response to this:

Fixes #2080

Proposed Changes

  • Add an E2E test which restarts the Autoscaler and verifies scale-from-zero still works.
  • Same for the Activator.
  • Add a controller check in DiagnoseMe to verify the controllers are running.

Note: this will pass once #2159 is checked in.

Release Note

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

}

if successfulRequests != totalRequests {
return fmt.Errorf("Error making requests for scale up. Got %d successful requests. Wanted %d.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.


ctx.logger.Infof("Waiting for all requests to complete.")
if err := group.Wait(); err != nil {
return fmt.Errorf("Error making requests for scale up: %v.", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint errors: error strings should not be capitalized or end with punctuation or a newline. More info.

@josephburnett
Copy link
Contributor Author

/hold

Waiting for #2159 to be submitted.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2018
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/activator/revision.go 82.5% 82.1% -0.4
pkg/reconciler/v1alpha1/autoscaling/kpa_scaler.go 68.0% 79.5% 11.5

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 8, 2018
@josephburnett
Copy link
Contributor Author

/retest

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two kinda small comments, I'm a little concerned about the Sleep call.

Great overall though, I like the readability of the tests themselves! Will be great to have those.

if err != nil {
logger.Fatalf("Error during initial scale up: %v", err)
ctx.logger.Fatalf("Error during initial scale up: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ctx.t.Fatalf? Otherwise this won't show up as a failed test, will it? Applies to the next occurence as well (line 159)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be t.Fatal.

}
ctx.logger.Infof("Deployment %q has been bounced.", name)
time.Sleep(10 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for the pod to become ready? What else is this sleep waiting for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of the autoscaler, allowing for the pods to reconnect. That's not represented in pod readiness, so I just sleep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on why there is a sleep? Also, if there is better solution or suggestion, we should create an issue and link it here to remove the sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this again without the sleep, and it passes. So I've just removed this.


logger.Infof("All %d requests succeeded.", totalRequests)
return nil
type testContext struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is only applicable to the autoscale test or we should move it to common test/e2e.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from #2164. Now that it's submitted, I'll merge and remove the WIP label.

@srinivashegde86
Copy link
Contributor

The changes seem similar to #2164

Are you waiting on one of them to merge before the other?

@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2018
@josephburnett
Copy link
Contributor Author

@srinivashegde86 and @markusthoemmes, I've removed the part you were concerned about. Turns out it isn't necessary.

@srinivashegde86
Copy link
Contributor

/approve

Changes LGTM. I will let markus add it lgtm label once he takes a look.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: josephburnett, srinivashegde86

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2018
@josephburnett
Copy link
Contributor Author

Looks like a TestBuildAndServe flake.

@josephburnett
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

@josephburnett
Copy link
Contributor Author

Now TestHelloWorld failed. I don't think I broke it that badly!

@bbrowning
Copy link
Contributor

bbrowning commented Oct 17, 2018

If you look at the raw build log - for example https://storage.googleapis.com/knative-prow/pr-logs/pull/knative_serving/2170/pull-knative-serving-integration-tests/1052657354991996928/build-log.txt - you'll see panic: test timed out after 20m0s. It looks like with the new tests the e2e tests are just taking longer than 20 minutes. You can bump this to a higher number at

-v -tags=e2e -count=1 -timeout=20m \

@josephburnett
Copy link
Contributor Author

Passes with 40 minute timeout. I know that's a big bump, but I don't want to adjust the cluster-wide scale-to-zero threshold and I'm doing three tests that require scaling to zero. I can probably improve that by adding annotations to override scale-to-zero values, but that's a bit much for this pull request.

@markusthoemmes and @srinivashegde86 could you take a look and LGTM?

logger.Errorf("Unable to get %v deployment: %v", name, err)
continue
}
if dep.Status.AvailableReplicas != 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need some adjustment in #2171. @mattmoor fyi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activator is not a controller, neither is webhook. Can we just remove this?

@markusthoemmes
Copy link
Contributor

/lgtm

@markusthoemmes
Copy link
Contributor

CLA wise I guess you'll have to rebase out my old commits?

@@ -81,7 +81,7 @@ kubectl create namespace serving-tests
options=""
(( EMIT_METRICS )) && options="-emitmetrics"
report_go_test \
-v -tags=e2e -count=1 -timeout=20m \
-v -tags=e2e -count=1 -timeout=40m \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to split these tests. 30+ minutes for the whole test suite is starting to hurt productivity.

Copy link
Contributor Author

@josephburnett josephburnett Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller reboot tests cannot run in parallel with other tests. But most of the other tests can. Should we separate the longer tests from the short ones?

Alternatively, we can implement controllers for each namespace: #2107 and run tests in different namespaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evankanderson, @mattmoor, @vaikas-google what do you think about running separate controllers in each namespace (configurable)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is long lead (longer than we should wait).

@adrcunha what is your recommendation for splitting things up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller reboot tests cannot run in parallel with other tests. But most of the other tests can. Should we separate the longer tests from the short ones?

Don't the conformance and e2e tests already run in parallel on the same cluster? Or am I misremembering the current test setup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right @bbrowning, the "DiagnoseMe" output actually proves that there are occasionally tests running in parallel to the AutoscaleUpDown test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrcunha, what should we do here? Split out the reboot tests that have to run serially?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickest fix is to run the reboot tests after the other e2e/conformance tests. But I feel that we actually need to split our e2e test job to avoid the 30+ run time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the required infra to run these tests separate from the e2e job, and the upgrade tests are already using it.

@bbrowning
Copy link
Contributor

Passes with 40 minute timeout. I know that's a big bump, but I don't want to adjust the cluster-wide scale-to-zero threshold and I'm doing three tests that require scaling to zero. I can probably improve that by adding annotations to override scale-to-zero values, but that's a bit much for this pull request.

Actually, is there any downside to lowering the cluster-wide scale-to-zero threshold for all tests to something less than 5 minutes? We don't want to lower it during a test, but is there a downside to lowering it before starting the tests?

@markusthoemmes
Copy link
Contributor

/hold

Putting in a hold so this doesn't accidentally slip in.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2018
@mattmoor
Copy link
Member

@bbrowning We used to do precisely this and it made all of the tests flakier. We could consider revisiting this as I can think of a number of sources of potential flakes that have been squashed lately.

cc @josephburnett @adrcunha thoughts?

@josephburnett
Copy link
Contributor Author

josephburnett commented Oct 31, 2018

We could consider revisiting this as I can think of a number of sources of potential flakes that have been squashed lately. @mattmoor

Before we go back to modifying the config map, we should fix #2155 which was the source of some BlueGreen failures (@tanzeeb). But after that, yes, let's revisit config map modification.

@markusthoemmes
Copy link
Contributor

Can we revive this now that our tests run in sequence?

@adrcunha
Copy link
Contributor

adrcunha commented Feb 8, 2019

Better putting them in a separate job, I'm afraid they'll make the integration tests runtime too long.

@mattmoor
Copy link
Member

mattmoor commented May 6, 2019

closing this stale PR, reopen if still interesting.

@mattmoor mattmoor closed this May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale up from 0 doesn't happen after upgrades
9 participants