Merge #62015 #62039

62015: cli: Add some more warning comments to unsafe-remove-dead-replicas r=knz a=bdarnell The comments always said this tool was meant to be used with the supervision of a CRL engineer, but didn't otherwise make the risks and downsides clear. Add some more explicit warnings which can also serve as guidance for the supervising engineer. Release note: None 62039: roachtest: stabilize tpccbench/nodes=3/cpu=16 r=irfansharif a=irfansharif Fixes #61973. With tracing, our top-of-line TPC-C performance took a hit. Given that the TPC-C line searcher starts off at the estimated max, we're now starting off at "overloaded" territory; this makes for a very unhappy roachtest. Ideally we'd have something like #62010, or even admission control, to not make this test less noisy. Until then we can start off at a lower max warehouse count. This "fix" is still not a panacea, the entire tpccbench suite as written tries to nudge the warehouse count until the efficiency is sub-85%. Unfortunately, with our current infrastructure that's a stand-in for "the point where nodes are overloaded and VMs no longer reachable". See #61974. --- A longer-term approach to these tests could instead be as follows. We could start our search at whatever the max warehouse count is (making sure we've re-configure the max warehouses accordingly). These tests could then PASS/FAIL for that given warehouse count, and only if FAIL, could capture the extent of the regression by probing lower warehouse counts. This is in contrast to what we're doing today where we capture how high we can go (and by design risking going into overload territory, with no protections for it). Doing so lets us use this test suite to capture regressions from a given baseline, rather than hoping our roachperf dashboards capture unexpected perf improvements (if they're expected, we should update max warehouses accordingly). In the steady state, we should want the roachperf dashboards to be mostly flatlined, with step-increases when we're re-upping the max warehouse count to incorporate various system-wide performance increases. Release note: None Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: irfan sharif <[email protected]>
cockroachdb · Mar 16, 2021 · 86d6a45 · 86d6a45
3 parents 44ab6bc + d29a7be + 72c96fa
commit 86d6a45
Show file tree

Hide file tree

Showing 2 changed files with 28 additions and 4 deletions.
diff --git a/pkg/cli/debug.go b/pkg/cli/debug.go
@@ -874,20 +874,44 @@ var debugUnsafeRemoveDeadReplicasCmd = &cobra.Command{
 This command is UNSAFE and should only be used with the supervision of
 a Cockroach Labs engineer. It is a last-resort option to recover data
 after multiple node failures. The recovered data is not guaranteed to
-be consistent.
+be consistent. If a suitable backup exists, restore it instead of
+using this tool.
 
 The --dead-store-ids flag takes a comma-separated list of dead store
 IDs and scans this store for any ranges whose only live replica is on
 this store. These range descriptors will be edited to forcibly remove
 the dead stores, allowing the range to recover from this single
 replica.
 
+This command will prompt for confirmation before committing its changes.
+
+It is safest to run this command while all nodes are stopped. In some
+circumstances it may be possible to run it while some nodes are still
+running provided all nodes containing replicas of nodes that have lost
+quorum are stopped.
+
+It is recommended to take a filesystem-level backup or snapshot of the
+nodes to be affected before running this command (remember that it is
+not safe to take a filesystem-level backup of a running node, but it is
+possible while the node is stopped)
+
+WARNINGS
+
+This tool will cause previously committed data to be lost. It does not
+preserve atomicity of transactions, so further inconsistencies and
+undefined behavior may result. Before proceeding at the yes/no prompt,
+review the ranges that are affected to consider the possible impact
+of inconsistencies. Further remediation may be necessary after running
+this tool, including dropping and recreating affected indexes, or in the
+worst case creating a new backup or export of this cluster's data for
+restoration into a brand new cluster. Because of the latter possibilities,
+this tool is a slower means of disaster recovery than restoring from
+a backup.
+
 Must only be used when the dead stores are lost and unrecoverable. If
 the dead stores were to rejoin the cluster after this command was
 used, data may be corrupted.
 
-This command will prompt for confirmation before committing its changes.
-
 After this command is used, the node should not be restarted until at
 least 10 seconds have passed since it was stopped. Restarting it too
 early may lead to things getting stuck (if it happens, it can be fixed

diff --git a/pkg/cmd/roachtest/tpcc.go b/pkg/cmd/roachtest/tpcc.go
@@ -411,7 +411,7 @@ func registerTPCC(r *testRegistry) {
 		CPUs:  16,
 
 		LoadWarehouses: gceOrAws(cloud, 2500, 3000),
-		EstimatedMax:   gceOrAws(cloud, 2200, 2500),
+		EstimatedMax:   gceOrAws(cloud, 2100, 2500),
 	})
 	registerTPCCBenchSpec(r, tpccBenchSpec{
 		Nodes: 12,