Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: roachprod: improve the way cockroach is run #64641

Merged
merged 3 commits into from
May 14, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented May 4, 2021

To be merged only post-21.1.0.

Closes #64967.


Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

RaduBerinde and others added 3 commits May 4, 2021 10:41
While running some stress tests with TPCH, I observed two big problems
with roachprod:
 - the out-of-memory behavior is very bad: instead of the process
   being killed, the system enters a thrashing mode where everything
   in the VM slows to a crawl (to the point where just sshing in can
   take minutes).
 - when the cockroach process exits, the exit code is not recorded
   anywhere, making it impossible in some cases to figure out why it
   stopped. In my particular case, we were exiting with exit code 8
   (which is `exit.TimeoutAfterFatalError()`) because writing to the
   logs was unacceptably slow.

This commit attempts to improve things on both these fronts. Instead
of running with `--background`, we use `systemd-run` to run cockroach
as a service unit. This has several advantages:
 - we have much better monitoring infrastructure via
   `systemctl status cockroach`
 - we can now run code after the exit, allowing us to record it in
   various logs.
 - we can set a strict cgroups memory limit (set to `95%`) so that the
   process gets oom-killed before the system starts to thrash.

As part of the commit, we also print out information about the status
of cockroach when logging in.

Fixes cockroachdb#64176.

Release note: None
Upgrading the VM image for AWS roachprod VMs from ubuntu 16.04 to
20.04. This fixes problems recently introduced in the start command.

Steps, recorded here in case someone in the future wants to do the
same and looks at git history:
  1. Found a new image using the AWS web console, under AMIs.
  2. Modified the image_name in `vm/aws/terraform/aws-region/main.tf`.
  3. Installed terraform 0.11; `inside vm/aws/terraform` I ran:
    - `terraform init`
    - `terraform apply`
    - `terraform output --json > ../config.json`
  4. Regenerated `embedded.go` using `make generate PKG=./pkg/cmd/roachprod/...`

Release note: None
Among likely many other nightly failures, this:

Fixes cockroachdb#64457

Release note: None
@tbg tbg requested review from RaduBerinde, a team and otan and removed request for a team May 4, 2021 08:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented May 13, 2021

@RaduBerinde @nvanbenschoten @andreimatei are we aware of any problems caused by the systemd setup? I would like to merge this next week after spot checking the health of the master runs again (it's been quiet... maybe too quiet).

@RaduBerinde
Copy link
Member

I'm not aware of any new problems.

@nvanbenschoten
Copy link
Member

Neither am I.

@tbg tbg merged commit 7bf671c into cockroachdb:release-21.1 May 14, 2021
@tbg tbg deleted the backport21.1-64177-64436-64560 branch May 14, 2021 11:33
@tbg
Copy link
Member Author

tbg commented May 14, 2021

YOLO it is then. I'll keep an eye out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants