sql: zeroqps during schema change with reduced `bulk_io_write.max_rate` #36806

awoods187 · 2019-04-12T16:01:49Z

Describe the problem

While testing out the new index backfill I SET CLUSTER SETTING kv.bulk_io_write.max_rate = '1mb'. Once the schema change started it dropped QPS to zero on the cluster:

To Reproduce

export CLUSTER=andy-3node
roachprod create $CLUSTER -n 4 --clouds=aws  --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER --  "sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-3 cockroach 0c83360778c511ab79103aefd8f5e3a115990144
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2500 --db=tpcc"

Then run the test:

roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=2500 --active-warehouses=2100 --duration=1h --split --scatter {pgurl:1-3}"
SET CLUSTER SETTING kv.bulk_io_write.max_rate = '1mb'
CREATE UNIQUE INDEX ON tpcc.order (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id);

Environment:
19.1 branch
v19.1.0-rc.2-40-g0c83360

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-04-12T16:03:49Z

Debug.zip
https://drive.google.com/open?id=1jw9AtF0rEftUyhrjEriXEJN-PMPHE9dB

awoods187 · 2019-04-12T16:07:26Z

Even after setting this to: SET CLUSTER SETTING kv.bulk_io_write.max_rate = '10mb'; i don't see any progress:

| statement | user_name |  status   | running_status |             created              |             started              |             finished             |             modified             | fraction_completed | error | coordinator_id
+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+-----------+----------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+--------------------+-------+----------------+
  442413495208083457 | SCHEMA CHANGE | CREATE UNIQUE INDEX ON tpcc.public."order" (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |           | root      | running   | NULL           | 2019-04-12 15:51:24.028409+00:00 | 2019-04-12 15:51:24.485747+00:00 | NULL                             | 2019-04-12 15:51:24.485748+00:00 |                  0 |       |           NULL

awoods187 · 2019-04-12T16:36:21Z

I stopped the workload and restarted it to try to get this test working again but its still seeing zeroqps.

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 delivery
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 payment
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 delivery
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 payment
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 delivery
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 payment
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel

Followed by:

Error: error in newOrder: ERROR: batch timestamp 1555086289.937642305,0 must be after replica GC threshold 1555086309.751605705,0 (SQLSTATE XX000)
Error:  exit status 1

awoods187 · 2019-04-12T16:37:56Z

goroutine dump.txt

jordanlewis · 2019-04-12T20:29:56Z

Looks like some stuff is blocked resolving intents?

Shouldn't these calls have timeouts? Or some other way to cancel themselves?

goroutine 8487038 [select, 16 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).ResolveIntents(0xc000cbb860, 0x3a39940, 0xc025b93bc0, 0xc02ec3c240, 0x4, 0x4, 0x1594c61e93870001, 0x1594c61e9387405d, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:919 +0x9bb
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).cleanupFinishedTxnIntents(0xc000cbb860, 0x3a39980, 0xc0000540a0, 0x1c7f, 0xc06c88e100, 0xc02ec3c240, 0x4, 0x4, 0x1594c6886d53ccc1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:808 +0xed
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).CleanupTxnIntentsAsync.func1(0x3a39980, 0xc0000540a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:610 +0x188
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc0001e01b0, 0x3a39980, 0xc0000540a0, 0x33223c4, 0x2a, 0xc000a54ea0, 0x0, 0x0, 0xc01284eab0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a
goroutine 7568842 [select, 40 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).ResolveIntents(0xc000cbb860, 0x3a39940, 0xc06132f4c0, 0xc039adf200, 0x4, 0x4, 0x1594c4ee28950001, 0x1594c4ee289504d2, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:919 +0x9bb
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).cleanupFinishedTxnIntents(0xc000cbb860, 0x3a39980, 0xc0000540a0, 0x1c8c, 0xc036efc500, 0xc039adf200, 0x4, 0x4, 0x1594c53a489a7390, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:808 +0xed
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).CleanupTxnIntentsAsync.func1(0x3a39980, 0xc0000540a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:610 +0x188
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc0001e01b0, 0x3a39980, 0xc0000540a0, 0x33223c4, 0x2a, 0xc000a54ea0, 0x0, 0x0, 0xc018fd6630)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a

ajwerner · 2019-04-13T20:18:00Z

Yes they should probably have a timeout, especially given that intent resolution is best effort. I can make a PR for that when I get back.

That being said, this seems more like a symptom than a cause. Why is intent resolution blocked? Are there unavailable ranges?

tbg · 2019-04-15T12:20:40Z

Agreed that a timeout here would be good. Also agree that this isn't the root cause of the problem. Here's closer to the root cause (from the goroutine dump):

1: select [46 minutes] [Created by stop.(*Stopper).RunWorker @ stopper.go:193]
    rate           rate.go:253                   (*Limiter).WaitN(*Limiter(#407), Context(#56), 2097152, 0x0, 0x0)
    storage        syncing_write.go:51           limitBulkIOWrite(Context(#56), *Limiter(#407), 2097152)
    storage        syncing_write.go:107          writeFileSyncing(Context(#56), string(#1942, len=63), []byte(#5969 len=3155094 cap=3162112), Engine(#69), FileMode(0x1a4), ...)
    storage        replica_sideload_disk.go:192  (*diskSideloadStorage).Put(*diskSideloadStorage(#967), Context(#56), 4525, 8, []byte(#5969 len=3155094 cap=3162112), 0x0, 0x0)
    storage        replica_sideload.go:160       maybeSideloadEntriesImpl(Context(#56), []Entry(#5555 len=1 cap=1), SideloadStorage(#59), func(#1912), 0x734f12, ...)
    storage        replica_sideload.go:84        (*Replica).maybeSideloadEntriesRaftMuLocked(*Replica(#2927), Context(#56), []Entry(#5554 len=1 cap=1), #4199, 0x1, 0x1, 0x0, ...)
    storage        replica_raft.go:682           (*Replica).handleRaftReadyRaftMuLocked(*Replica(#2927), Context(#56), IncomingSnapshot(0x0), 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    storage        store.go:3648                 (*Store).processRequestQueue.func1(*Store(#56), RangeID(#4393), #56)
    storage        store.go:3295                 (*Store).withReplicaForRequest(*Store(#843), Context(#56), *RaftMessageRequest(#4922), RaftMessageResponseStream(#4921))
    storage        store.go:3636                 (*Store).processRequestQueue(*Store(#843), Context(#56), *RaftMessageResponse(0xcf))

An SST write is waiting since 46 minutes for the rate limiter. There isn't anyone else using this limiter. So... is the limiter maybe broken? This doesn't seem super likely to me but from the goroutines that looks to be the case.

As a side effect of this stall, raft processing for this replica was blocked, so any requests entering it would get stuck as well. This could explain the zero qps, and I also assume that the blocking intent resolution requests had an intent on that range and got that request stuck there.

tbg · 2019-04-15T12:21:34Z

cc @bdarnell -- this seems another problem worthy of fixing before we release with dist backfill on.

vivekmenezes · 2019-04-15T15:55:32Z

FYI I'll looking at checking if setting SET CLUSTER SETTING kv.bulk_io_write.addsstable_max_rate=0.001 will not cause this problem. If it doesn't work then for our users we'll have to recommend turning off the new backfill as a remedy in case they hit problems

bdarnell · 2019-04-15T16:43:58Z

@ajwerner There was a timeout on intent resolution before, but it got lost in #34803.

So... is the limiter maybe broken?

The kv.bulk_io_write.max_rate setting defaults to MaxInt64. The rate limiter internally tracks things with a float64. This seems quite likely to result in precision-related math problems.

ajwerner · 2019-04-15T16:45:50Z

Typing up a diff to add back timeouts now.

bdarnell · 2019-04-15T18:24:18Z

Does this need to go on the release blockers list (#35554)?

vivekmenezes · 2019-04-15T18:31:41Z

No. As far as controlling schema changes is concerned we already have #36430 which @awoods187 is presently testing to see if it works. We'll either reopen that issue or marked it fixed.

This issue is still being investigated.

bdarnell · 2019-04-15T18:52:33Z

The subject line omits a crucial detail: the zero qps is a result of setting bulk_io_write.max_rate to a non-default setting. Since it's not on by default it's not a release blocker, but this does mean that this rate limiter is problematic and we can't rely on it as an emergency resolution to any problems (probably because it slows things down in the wrong place, in the raft critical path).

The new rate limiter looks like it's in a better place, although I'm not sure that we've actually solved the problem unless we give it a good default.

This PR adds a configuration option for a timeout for batch request sends in the request batcher and adopts this option in the intent resolver. Prior to the introduction of batching in cockroachdb#34803 intent resolution was performed with a maximum timeout of 30s. This PR uses that same constant and additionally applies it to batches of requests to gc transaction records. Informs cockroachdb#36806. Release note: None

vivekmenezes · 2019-04-15T20:16:22Z

I've updated #36430 with the latest test results. We can followup there on what could be good defaults.

This PR adds a configuration option for a timeout for batch request sends in the request batcher and adopts this option in the intent resolver. Prior to the introduction of batching in cockroachdb#34803 intent resolution was performed with a maximum timeout of 30s. This PR uses that same constant and additionally applies it to batches of requests to gc transaction records. Informs cockroachdb#36806. Release note: None

36845: requestbatcher,intentresolver: add timeout in requestbatcher and adopt r=ajwerner a=ajwerner This PR adds a configuration option for a timeout for batch request sends in the request batcher and adopts this option in the intent resolver. Prior to the introduction of batching in #34803 intent resolution was performed with a maximum timeout of 30s. This PR uses that same constant and additionally applies it to batches of requests to gc transaction records. Informs #36806. Release note: None Co-authored-by: Andrew Werner <[email protected]>

vivekmenezes · 2019-04-16T16:24:03Z

@ajwerner feel free to close this issue if you're satisfied.

awoods187 · 2019-04-16T16:28:09Z

@vivekmenezes I thought we identified that this knob doesn't work as intended?

vivekmenezes · 2019-04-16T16:47:16Z

we should discuss if we should take this knob out. @dt has thoughts on why this knob might still be needed despite having the new knob that controls ADD SST rate. i believe it is to allow control over disk iops.

dt · 2019-04-16T16:55:48Z

we still need a disk IO limiter, not a file count limiter, since e.g. RESTORE ingests relatively fewer, bigger SSTs, and in IO-contsrained environments, we've seen doing a sustained, unlimited write starve out other traffic, and turning that knob down to 25mb or something be a big stability win.

It needs to be below raft, at least for now, since it needs to a) slow down the application of a 32 single file (so it sleeps between chunks) and b) needs to slow down on a node, by following many leaders that might each have a healthy rate, ends up at an unhealthy aggregate rate. It's still better to fall behind on the one range doing the bulk ingestion than to have the whole store get stuck waiting for IO.

I think we might need a lower-bound on it though to prevent slowing raft too far (e.g. max-wait of 5s or something), and ideally we'd somehow propagate information from followers to leaders so they could do better admission control rather than forcing the follower to just choose to fall behind.

thoszhang · 2019-04-16T19:27:52Z

I can easily reproduce the problem where the nodes are all blocked on (*Limiter).WaitN by running fixtures import with tpcc-100 and then building the index. Logging reveals that the rate limiter waits for 1-2 hours on the first call to limitBulkIOWrite in the backfill.

@tbg was right that it's a rate limiter problem: during the import, we are incurring an enormous deficit of tokens in the rate limiter due to floating point underflows and integer casts. The issue isn't the int64 default value for the rate per se, but the fact that the chunk sizes and burst value (both 2MB) are many orders of magnitude lower than the default rate (MaxInt64 bytes/s), implying that the tokens are maxed out in an interval of 2^(21-63) seconds (around 10e-13), which this function can't handle:

https://github.com/golang/time/blob/master/rate/rate.go#L365-L368

This function is rounding all intervals down to 0 whenever it's used, meaning that at the default rate, (1) there is no rate-limiting, and (2) tokens never get replenished. So when we finally change the rate to 1 MB/s, there's a huge deficit of tokens in the rate limiter (on the order of 10e9 after writing a few thousand chunks during importing tpcc-100), and the rate limiter correctly computes that it would take 1-2 hours to recover.

To fix this, I think we can just set the default for kv.bulk_io_write.max_rate to some value lower than 10e9 * 2 MB so nobody runs into this problem at the outset. I'm currently verifying that this works.

bdarnell · 2019-04-16T19:29:47Z

The rate limiter has a special case for MaxFloat64 (we're using MaxInt64 instead). We should change to use MaxFloat64 (or the alias rate.Inf) when we want "unlimited".

thoszhang · 2019-04-16T19:31:46Z

The problem is that kv.bulk_io_write.max_rate is a byte size setting, so we'd have to change its type, which seems undesirable.

36884: storage: decrease default kv.bulk_io_write.max_rate r=lucy-zhang a=lucy-zhang Previously `kv.bulk_io_write.max_rate` had a default value of `MaxInt64`, which caused rounding problems in the rate limiter due to very small time intervals. This lowers the default to 1 TB/s. Fixes #36806. (See #36806 (comment)) Release note (bug fix): The default value of `kv.bulk_io_write.max_rate` is now 1 TB/s, to help prevent incorrect rate limiting behavior due to rounding. Co-authored-by: Lucy Zhang <[email protected]>

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors A-schema-changes labels Apr 12, 2019

awoods187 assigned vivekmenezes Apr 12, 2019

ajwerner mentioned this issue Apr 15, 2019

requestbatcher,intentresolver: add timeout in requestbatcher and adopt #36845

Merged

vivekmenezes assigned thoszhang and unassigned vivekmenezes Apr 15, 2019

bdarnell changed the title ~~sql: zeroqps during schema change on 19.1 branch~~ sql: zeroqps during schema change with reduced bulk_io_write.max_rate Apr 15, 2019

vivekmenezes assigned ajwerner and unassigned thoszhang Apr 16, 2019

vivekmenezes assigned thoszhang Apr 16, 2019

thoszhang mentioned this issue Apr 16, 2019

storage: decrease default kv.bulk_io_write.max_rate #36884

Merged

thoszhang mentioned this issue Apr 17, 2019

release-19.1: storage: decrease default kv.bulk_io_write.max_rate #36912

Merged

craig bot closed this as completed in #36884 Apr 17, 2019

ajwerner mentioned this issue Apr 19, 2019

core: intent_resolver_ir_batcher.sendBatch timed out after 30s #36953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: zeroqps during schema change with reduced `bulk_io_write.max_rate` #36806

sql: zeroqps during schema change with reduced `bulk_io_write.max_rate` #36806

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

jordanlewis commented Apr 12, 2019

ajwerner commented Apr 13, 2019

tbg commented Apr 15, 2019

tbg commented Apr 15, 2019

vivekmenezes commented Apr 15, 2019 •

edited

Loading

bdarnell commented Apr 15, 2019

ajwerner commented Apr 15, 2019

bdarnell commented Apr 15, 2019

vivekmenezes commented Apr 15, 2019

bdarnell commented Apr 15, 2019

vivekmenezes commented Apr 15, 2019

vivekmenezes commented Apr 16, 2019

awoods187 commented Apr 16, 2019

vivekmenezes commented Apr 16, 2019

dt commented Apr 16, 2019

thoszhang commented Apr 16, 2019

bdarnell commented Apr 16, 2019

thoszhang commented Apr 16, 2019

sql: zeroqps during schema change with reduced bulk_io_write.max_rate #36806

sql: zeroqps during schema change with reduced bulk_io_write.max_rate #36806

Comments

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

awoods187 commented Apr 12, 2019

jordanlewis commented Apr 12, 2019

ajwerner commented Apr 13, 2019

tbg commented Apr 15, 2019

tbg commented Apr 15, 2019

vivekmenezes commented Apr 15, 2019 • edited Loading

bdarnell commented Apr 15, 2019

ajwerner commented Apr 15, 2019

bdarnell commented Apr 15, 2019

vivekmenezes commented Apr 15, 2019

bdarnell commented Apr 15, 2019

vivekmenezes commented Apr 15, 2019

vivekmenezes commented Apr 16, 2019

awoods187 commented Apr 16, 2019

vivekmenezes commented Apr 16, 2019

dt commented Apr 16, 2019

thoszhang commented Apr 16, 2019

bdarnell commented Apr 16, 2019

thoszhang commented Apr 16, 2019

sql: zeroqps during schema change with reduced `bulk_io_write.max_rate` #36806

sql: zeroqps during schema change with reduced `bulk_io_write.max_rate` #36806

vivekmenezes commented Apr 15, 2019 •

edited

Loading