Skip to content

Commit

Permalink
Merge #62015 #62039
Browse files Browse the repository at this point in the history
62015: cli: Add some more warning comments to unsafe-remove-dead-replicas r=knz a=bdarnell

The comments always said this tool was meant to be used with the
supervision of a CRL engineer, but didn't otherwise make the risks
and downsides clear. Add some more explicit warnings which can also
serve as guidance for the supervising engineer.

Release note: None

62039: roachtest: stabilize tpccbench/nodes=3/cpu=16 r=irfansharif a=irfansharif

Fixes #61973. With tracing, our top-of-line TPC-C performance took a
hit. Given that the TPC-C line searcher starts off at the estimated max,
we're now starting off at "overloaded" territory; this makes for a very
unhappy roachtest.

Ideally we'd have something like #62010, or even admission control, to
not make this test less noisy. Until then we can start off at a lower
max warehouse count.

This "fix" is still not a panacea, the entire tpccbench suite as written
tries to nudge the warehouse count until the efficiency is sub-85%.
Unfortunately, with our current infrastructure that's a stand-in for
"the point where nodes are overloaded and VMs no longer reachable". 
See #61974.

---

A longer-term approach to these tests could instead be as follows.
We could start our search at whatever the max warehouse count is (making
sure we've re-configure the max warehouses accordingly). These tests
could then PASS/FAIL for that given warehouse count, and only if FAIL,
could capture the extent of the regression by probing lower warehouse
counts. This is in contrast to what we're doing today where we capture
how high we can go (and by design risking going into overload territory,
with no protections for it).

Doing so lets us use this test suite to capture regressions from a given
baseline, rather than hoping our roachperf dashboards capture
unexpected perf improvements (if they're expected, we should update max
warehouses accordingly). In the steady state, we should want the
roachperf dashboards to be mostly flatlined, with step-increases when
we're re-upping the max warehouse count to incorporate various
system-wide performance increases.

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
  • Loading branch information
3 people committed Mar 16, 2021
3 parents 44ab6bc + d29a7be + 72c96fa commit 86d6a45
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 4 deletions.
30 changes: 27 additions & 3 deletions pkg/cli/debug.go
Original file line number Diff line number Diff line change
Expand Up @@ -874,20 +874,44 @@ var debugUnsafeRemoveDeadReplicasCmd = &cobra.Command{
This command is UNSAFE and should only be used with the supervision of
a Cockroach Labs engineer. It is a last-resort option to recover data
after multiple node failures. The recovered data is not guaranteed to
be consistent.
be consistent. If a suitable backup exists, restore it instead of
using this tool.
The --dead-store-ids flag takes a comma-separated list of dead store
IDs and scans this store for any ranges whose only live replica is on
this store. These range descriptors will be edited to forcibly remove
the dead stores, allowing the range to recover from this single
replica.
This command will prompt for confirmation before committing its changes.
It is safest to run this command while all nodes are stopped. In some
circumstances it may be possible to run it while some nodes are still
running provided all nodes containing replicas of nodes that have lost
quorum are stopped.
It is recommended to take a filesystem-level backup or snapshot of the
nodes to be affected before running this command (remember that it is
not safe to take a filesystem-level backup of a running node, but it is
possible while the node is stopped)
WARNINGS
This tool will cause previously committed data to be lost. It does not
preserve atomicity of transactions, so further inconsistencies and
undefined behavior may result. Before proceeding at the yes/no prompt,
review the ranges that are affected to consider the possible impact
of inconsistencies. Further remediation may be necessary after running
this tool, including dropping and recreating affected indexes, or in the
worst case creating a new backup or export of this cluster's data for
restoration into a brand new cluster. Because of the latter possibilities,
this tool is a slower means of disaster recovery than restoring from
a backup.
Must only be used when the dead stores are lost and unrecoverable. If
the dead stores were to rejoin the cluster after this command was
used, data may be corrupted.
This command will prompt for confirmation before committing its changes.
After this command is used, the node should not be restarted until at
least 10 seconds have passed since it was stopped. Restarting it too
early may lead to things getting stuck (if it happens, it can be fixed
Expand Down
2 changes: 1 addition & 1 deletion pkg/cmd/roachtest/tpcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@ func registerTPCC(r *testRegistry) {
CPUs: 16,

LoadWarehouses: gceOrAws(cloud, 2500, 3000),
EstimatedMax: gceOrAws(cloud, 2200, 2500),
EstimatedMax: gceOrAws(cloud, 2100, 2500),
})
registerTPCCBenchSpec(r, tpccBenchSpec{
Nodes: 12,
Expand Down

0 comments on commit 86d6a45

Please sign in to comment.