Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: zeroqps during schema change with reduced bulk_io_write.max_rate #36806

Closed
awoods187 opened this issue Apr 12, 2019 · 22 comments · Fixed by #36884
Closed

sql: zeroqps during schema change with reduced bulk_io_write.max_rate #36806

awoods187 opened this issue Apr 12, 2019 · 22 comments · Fixed by #36884
Assignees
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors

Comments

@awoods187
Copy link
Contributor

Describe the problem

While testing out the new index backfill I SET CLUSTER SETTING kv.bulk_io_write.max_rate = '1mb'. Once the schema change started it dropped QPS to zero on the cluster:
image
image
image

To Reproduce

export CLUSTER=andy-3node
roachprod create $CLUSTER -n 4 --clouds=aws  --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER --  "sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-3 cockroach 0c83360778c511ab79103aefd8f5e3a115990144
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2500 --db=tpcc"

Then run the test:

roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=2500 --active-warehouses=2100 --duration=1h --split --scatter {pgurl:1-3}"
SET CLUSTER SETTING kv.bulk_io_write.max_rate = '1mb'
CREATE UNIQUE INDEX ON tpcc.order (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id);

Environment:
19.1 branch
v19.1.0-rc.2-40-g0c83360

@awoods187
Copy link
Contributor Author

@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors A-schema-changes labels Apr 12, 2019
@awoods187
Copy link
Contributor Author

Even after setting this to: SET CLUSTER SETTING kv.bulk_io_write.max_rate = '10mb'; i don't see any progress:
image

| statement | user_name |  status   | running_status |             created              |             started              |             finished             |             modified             | fraction_completed | error | coordinator_id
+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-----------+-----------+----------------+----------------------------------+----------------------------------+----------------------------------+----------------------------------+--------------------+-------+----------------+
  442413495208083457 | SCHEMA CHANGE | CREATE UNIQUE INDEX ON tpcc.public."order" (o_entry_d, o_w_id, o_d_id, o_carrier_id, o_id)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |           | root      | running   | NULL           | 2019-04-12 15:51:24.028409+00:00 | 2019-04-12 15:51:24.485747+00:00 | NULL                             | 2019-04-12 15:51:24.485748+00:00 |                  0 |       |           NULL

@awoods187
Copy link
Contributor Author

I stopped the workload and restarted it to try to get this test working again but its still seeing zeroqps.

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 delivery
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 payment
   6m37s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 delivery
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 payment
   6m38s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 delivery
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 newOrder
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 orderStatus
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 payment
   6m39s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel

Followed by:

Error: error in newOrder: ERROR: batch timestamp 1555086289.937642305,0 must be after replica GC threshold 1555086309.751605705,0 (SQLSTATE XX000)
Error:  exit status 1

@awoods187
Copy link
Contributor Author

goroutine dump.txt

@jordanlewis
Copy link
Member

Looks like some stuff is blocked resolving intents?

Shouldn't these calls have timeouts? Or some other way to cancel themselves?

goroutine 8487038 [select, 16 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).ResolveIntents(0xc000cbb860, 0x3a39940, 0xc025b93bc0, 0xc02ec3c240, 0x4, 0x4, 0x1594c61e93870001, 0x1594c61e9387405d, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:919 +0x9bb
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).cleanupFinishedTxnIntents(0xc000cbb860, 0x3a39980, 0xc0000540a0, 0x1c7f, 0xc06c88e100, 0xc02ec3c240, 0x4, 0x4, 0x1594c6886d53ccc1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:808 +0xed
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).CleanupTxnIntentsAsync.func1(0x3a39980, 0xc0000540a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:610 +0x188
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc0001e01b0, 0x3a39980, 0xc0000540a0, 0x33223c4, 0x2a, 0xc000a54ea0, 0x0, 0x0, 0xc01284eab0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a
goroutine 7568842 [select, 40 minutes]:
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).ResolveIntents(0xc000cbb860, 0x3a39940, 0xc06132f4c0, 0xc039adf200, 0x4, 0x4, 0x1594c4ee28950001, 0x1594c4ee289504d2, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:919 +0x9bb
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).cleanupFinishedTxnIntents(0xc000cbb860, 0x3a39980, 0xc0000540a0, 0x1c8c, 0xc036efc500, 0xc039adf200, 0x4, 0x4, 0x1594c53a489a7390, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:808 +0xed
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.(*IntentResolver).CleanupTxnIntentsAsync.func1(0x3a39980, 0xc0000540a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver.go:610 +0x188
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc0001e01b0, 0x3a39980, 0xc0000540a0, 0x33223c4, 0x2a, 0xc000a54ea0, 0x0, 0x0, 0xc018fd6630)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a

@ajwerner
Copy link
Contributor

Yes they should probably have a timeout, especially given that intent resolution is best effort. I can make a PR for that when I get back.

That being said, this seems more like a symptom than a cause. Why is intent resolution blocked? Are there unavailable ranges?

@tbg
Copy link
Member

tbg commented Apr 15, 2019

Agreed that a timeout here would be good. Also agree that this isn't the root cause of the problem. Here's closer to the root cause (from the goroutine dump):

1: select [46 minutes] [Created by stop.(*Stopper).RunWorker @ stopper.go:193]
    rate           rate.go:253                   (*Limiter).WaitN(*Limiter(#407), Context(#56), 2097152, 0x0, 0x0)
    storage        syncing_write.go:51           limitBulkIOWrite(Context(#56), *Limiter(#407), 2097152)
    storage        syncing_write.go:107          writeFileSyncing(Context(#56), string(#1942, len=63), []byte(#5969 len=3155094 cap=3162112), Engine(#69), FileMode(0x1a4), ...)
    storage        replica_sideload_disk.go:192  (*diskSideloadStorage).Put(*diskSideloadStorage(#967), Context(#56), 4525, 8, []byte(#5969 len=3155094 cap=3162112), 0x0, 0x0)
    storage        replica_sideload.go:160       maybeSideloadEntriesImpl(Context(#56), []Entry(#5555 len=1 cap=1), SideloadStorage(#59), func(#1912), 0x734f12, ...)
    storage        replica_sideload.go:84        (*Replica).maybeSideloadEntriesRaftMuLocked(*Replica(#2927), Context(#56), []Entry(#5554 len=1 cap=1), #4199, 0x1, 0x1, 0x0, ...)
    storage        replica_raft.go:682           (*Replica).handleRaftReadyRaftMuLocked(*Replica(#2927), Context(#56), IncomingSnapshot(0x0), 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    storage        store.go:3648                 (*Store).processRequestQueue.func1(*Store(#56), RangeID(#4393), #56)
    storage        store.go:3295                 (*Store).withReplicaForRequest(*Store(#843), Context(#56), *RaftMessageRequest(#4922), RaftMessageResponseStream(#4921))
    storage        store.go:3636                 (*Store).processRequestQueue(*Store(#843), Context(#56), *RaftMessageResponse(0xcf))

An SST write is waiting since 46 minutes for the rate limiter. There isn't anyone else using this limiter. So... is the limiter maybe broken? This doesn't seem super likely to me but from the goroutines that looks to be the case.

As a side effect of this stall, raft processing for this replica was blocked, so any requests entering it would get stuck as well. This could explain the zero qps, and I also assume that the blocking intent resolution requests had an intent on that range and got that request stuck there.

@tbg
Copy link
Member

tbg commented Apr 15, 2019

cc @bdarnell -- this seems another problem worthy of fixing before we release with dist backfill on.

@vivekmenezes
Copy link
Contributor

vivekmenezes commented Apr 15, 2019

FYI I'll looking at checking if setting SET CLUSTER SETTING kv.bulk_io_write.addsstable_max_rate=0.001 will not cause this problem. If it doesn't work then for our users we'll have to recommend turning off the new backfill as a remedy in case they hit problems

@bdarnell
Copy link
Contributor

@ajwerner There was a timeout on intent resolution before, but it got lost in #34803.

So... is the limiter maybe broken?

The kv.bulk_io_write.max_rate setting defaults to MaxInt64. The rate limiter internally tracks things with a float64. This seems quite likely to result in precision-related math problems.

@ajwerner
Copy link
Contributor

Typing up a diff to add back timeouts now.

@bdarnell
Copy link
Contributor

Does this need to go on the release blockers list (#35554)?

@vivekmenezes
Copy link
Contributor

No. As far as controlling schema changes is concerned we already have #36430 which @awoods187 is presently testing to see if it works. We'll either reopen that issue or marked it fixed.

This issue is still being investigated.

@bdarnell
Copy link
Contributor

The subject line omits a crucial detail: the zero qps is a result of setting bulk_io_write.max_rate to a non-default setting. Since it's not on by default it's not a release blocker, but this does mean that this rate limiter is problematic and we can't rely on it as an emergency resolution to any problems (probably because it slows things down in the wrong place, in the raft critical path).

The new rate limiter looks like it's in a better place, although I'm not sure that we've actually solved the problem unless we give it a good default.

@bdarnell bdarnell changed the title sql: zeroqps during schema change on 19.1 branch sql: zeroqps during schema change with reduced bulk_io_write.max_rate Apr 15, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 15, 2019
This PR adds a configuration option for a timeout for batch request sends in
the request batcher and adopts this option in the intent resolver. Prior to
the introduction of batching in cockroachdb#34803 intent resolution was performed with a
maximum timeout of 30s. This PR uses that same constant and additionally
applies it to batches of requests to gc transaction records.

Informs cockroachdb#36806.

Release note: None
@vivekmenezes
Copy link
Contributor

I've updated #36430 with the latest test results. We can followup there on what could be good defaults.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Apr 16, 2019
This PR adds a configuration option for a timeout for batch request sends in
the request batcher and adopts this option in the intent resolver. Prior to
the introduction of batching in cockroachdb#34803 intent resolution was performed with a
maximum timeout of 30s. This PR uses that same constant and additionally
applies it to batches of requests to gc transaction records.

Informs cockroachdb#36806.

Release note: None
craig bot pushed a commit that referenced this issue Apr 16, 2019
36845: requestbatcher,intentresolver: add timeout in requestbatcher and adopt r=ajwerner a=ajwerner

This PR adds a configuration option for a timeout for batch request sends in
the request batcher and adopts this option in the intent resolver. Prior to
the introduction of batching in #34803 intent resolution was performed with a
maximum timeout of 30s. This PR uses that same constant and additionally
applies it to batches of requests to gc transaction records.

Informs #36806.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@vivekmenezes
Copy link
Contributor

@ajwerner feel free to close this issue if you're satisfied.

@vivekmenezes vivekmenezes assigned ajwerner and unassigned thoszhang Apr 16, 2019
@awoods187
Copy link
Contributor Author

@vivekmenezes I thought we identified that this knob doesn't work as intended?

@vivekmenezes
Copy link
Contributor

we should discuss if we should take this knob out. @dt has thoughts on why this knob might still be needed despite having the new knob that controls ADD SST rate. i believe it is to allow control over disk iops.

@dt
Copy link
Member

dt commented Apr 16, 2019

we still need a disk IO limiter, not a file count limiter, since e.g. RESTORE ingests relatively fewer, bigger SSTs, and in IO-contsrained environments, we've seen doing a sustained, unlimited write starve out other traffic, and turning that knob down to 25mb or something be a big stability win.

It needs to be below raft, at least for now, since it needs to a) slow down the application of a 32 single file (so it sleeps between chunks) and b) needs to slow down on a node, by following many leaders that might each have a healthy rate, ends up at an unhealthy aggregate rate. It's still better to fall behind on the one range doing the bulk ingestion than to have the whole store get stuck waiting for IO.

I think we might need a lower-bound on it though to prevent slowing raft too far (e.g. max-wait of 5s or something), and ideally we'd somehow propagate information from followers to leaders so they could do better admission control rather than forcing the follower to just choose to fall behind.

@thoszhang
Copy link
Contributor

I can easily reproduce the problem where the nodes are all blocked on (*Limiter).WaitN by running fixtures import with tpcc-100 and then building the index. Logging reveals that the rate limiter waits for 1-2 hours on the first call to limitBulkIOWrite in the backfill.

@tbg was right that it's a rate limiter problem: during the import, we are incurring an enormous deficit of tokens in the rate limiter due to floating point underflows and integer casts. The issue isn't the int64 default value for the rate per se, but the fact that the chunk sizes and burst value (both 2MB) are many orders of magnitude lower than the default rate (MaxInt64 bytes/s), implying that the tokens are maxed out in an interval of 2^(21-63) seconds (around 10e-13), which this function can't handle:

https://github.com/golang/time/blob/master/rate/rate.go#L365-L368

This function is rounding all intervals down to 0 whenever it's used, meaning that at the default rate, (1) there is no rate-limiting, and (2) tokens never get replenished. So when we finally change the rate to 1 MB/s, there's a huge deficit of tokens in the rate limiter (on the order of 10e9 after writing a few thousand chunks during importing tpcc-100), and the rate limiter correctly computes that it would take 1-2 hours to recover.

To fix this, I think we can just set the default for kv.bulk_io_write.max_rate to some value lower than 10e9 * 2 MB so nobody runs into this problem at the outset. I'm currently verifying that this works.

@bdarnell
Copy link
Contributor

The rate limiter has a special case for MaxFloat64 (we're using MaxInt64 instead). We should change to use MaxFloat64 (or the alias rate.Inf) when we want "unlimited".

@thoszhang
Copy link
Contributor

The problem is that kv.bulk_io_write.max_rate is a byte size setting, so we'd have to change its type, which seems undesirable.

craig bot pushed a commit that referenced this issue Apr 17, 2019
36884: storage: decrease default kv.bulk_io_write.max_rate r=lucy-zhang a=lucy-zhang

Previously `kv.bulk_io_write.max_rate` had a default value of `MaxInt64`, which
caused rounding problems in the rate limiter due to very small time intervals.
This lowers the default to 1 TB/s.

Fixes #36806.
(See #36806 (comment))

Release note (bug fix): The default value of `kv.bulk_io_write.max_rate` is now
1 TB/s, to help prevent incorrect rate limiting behavior due to rounding.

Co-authored-by: Lucy Zhang <[email protected]>
@craig craig bot closed this as completed in #36884 Apr 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants