colflow: prevent deadlocks when many queries spill to disk at same time #84398

yuzefovich · 2022-07-13T22:23:42Z

colflow: prevent deadlocks when many queries spill to disk at same time

This commit fixes a long-standing issue which could cause
memory-intensive queries to deadlock on acquiring the file descriptors
quota when vectorized execution spills to disk. This bug has been
present since the introduction of disk-spilling (over two and a half
years ago, introduced in #45318 and partially mitigated in #45892), but
we haven't seen this in any user reports, only in tpch_concurrency
roachtest runs, so the severity seems pretty minor.

Consider the following query plan:

   Node 1                   Node 2

TableReader              TableReader
    |                         |
HashRouter                HashRouter
    |     \  ___________ /    |
    |      \/__________       |
    |      /           \      |
HashAggregator         HashAggregator

and let's imagine that each hash aggregator has to spill to disk. This
would require acquiring the file descriptors quota. Now, imagine that
because of that hash aggregators' spilling, each of the hash routers has
slow outputs causing them to spill too. As a result, this query plan can
require A + 2 * R number of FDs of a single node to succeed where A
is the quota for a single hash aggregator (equal to 16 - with the
default value of COCKROACH_VEC_MAX_OPEN_FDS environment variable which
is 256) and R is the quota for a single router output (2). This means
that we can estimate that 20 FDs from each node are needed for the query
to finish execution with 16 FDs being acquired first.

Now imagine that this query is run with concurrency of 16. We can end up
in such a situation that all hash aggregators have spilled, fully
exhausting the global node limit on each node, so whenever the hash
router outputs need to spill, they block forever since no FDs will ever
be released, until a query is canceled or a node is shutdown. In other
words, we have a deadlock.

This commit fixes this situation by introducing a retry mechanism to
exponentially backoff when trying to acquire the FD quota, until a time
out. The randomizations provided by the retry package should be
sufficient so that some of the queries succeed while others result in
an error.

Unfortunately, I don't see a way to prevent this deadlock from occurring
in the first place without possible increase in latency in some case.
The difficult thing is that we currently acquire FDs only once we need
them, meaning once a particular component spills to disk. We could
acquire the maximum number of FDs that a query might need up-front,
before the query execution starts, but that could lead to starvation of
the queries that ultimately won't spill to disk. This seems like a much
worse impact than receiving timeout errors on some analytical queries
when run with high concurrency. We're not an OLAP database, so this
behavior seems ok.

Fixes: #80290.

Release note (bug fix): Previously, CockroachDB could deadlock when
evaluating analytical queries f multiple queries had to spill to disk
at the same time. This is now fixed by making some of the queries error
out instead.

roachtest: remove some debugging printouts in tpch_concurrency

This was added to track down the deadlock fixed in the previous commit,
so we no longer need it.

Release note: None

cockroach-teamcity · 2022-07-13T22:23:50Z

This change is

DrewKimball

Unfortunately, I don't see a way to prevent this deadlock from occurring
in the first place without possible increase in latency in some case.

Do you think it would be a good idea to expose the current number of acquired FDs and the estimated number a new query might use to admission control?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

pkg/sql/colflow/vectorized_flow.go line 102 at r1 (raw file):

	// it possible for other queries to proceed.
	opts := retry.Options{
		InitialBackoff:      100 * time.Millisecond,

This feels like a pretty long initial backoff time; what make you choose it?

yuzefovich

Added a regression test with a single query. I think it is possible to come up with a single query without the usage of the testing knobs that would never be executed successfully (like a query with 16 hash joins, and each of them having to spill to disk, like a chain of operators with all of them as the right input to each other), and previously such a query would just deadlock, now it'll always result in an error. This is an extreme edge case, so probably not worth spending too much time thinking about it.

Do you think it would be a good idea to expose the current number of acquired FDs and the estimated number a new query might use to admission control?

I think this number of file descriptors would be a bad resource to ask the admission control to pace. Imagine we execute a query with a hash join, with the current setting, it would be estimated to require 16 FDs. Let's assume that 99% of query executions don't spill to disk. Currently we can run such a query with concurrency of 1600 (256 / 16 / 1%) (subject to memory limits, if the tables are small) with no extra latency, but if we start estimating the FDs upfront and ask the admission control to pace these query executions, then we'd probably be able to only sustain on the order of 100 of such queries (it'll depend on what node-wide limit on number of FDs we'd use - I'd expect larger than 256, but not more than 10000 slots).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @DrewKimball)

pkg/sql/colflow/vectorized_flow.go line 102 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

This feels like a pretty long initial backoff time; what make you choose it?

My thinking is that we did try the fast path right above, and we couldn't acquire the quota, which means that the FD limit has been used up. This code path occurs only when spilling to disk, so the query latency is likely to be on the order of hundreds of milliseconds at least, thus, 100ms looks a good starting point (to allow other queries that spilled to finish and release the quota). Added some more commentary.

DrewKimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @DrewKimball)

This commit fixes a long-standing issue which could cause memory-intensive queries to deadlock on acquiring the file descriptors quota when vectorized execution spills to disk. This bug has been present since the introduction of disk-spilling (over two and a half years ago, introduced in cockroachdb#45318 and partially mitigated in cockroachdb#45892), but we haven't seen this in any user reports, only in `tpch_concurrency` roachtest runs, so the severity seems pretty minor. Consider the following query plan: ``` Node 1 Node 2 TableReader TableReader | | HashRouter HashRouter | \ ___________ / | | \/__________ | | / \ | HashAggregator HashAggregator ``` and let's imagine that each hash aggregator has to spill to disk. This would require acquiring the file descriptors quota. Now, imagine that because of that hash aggregators' spilling, each of the hash routers has slow outputs causing them to spill too. As a result, this query plan can require `A + 2 * R` number of FDs of a single node to succeed where `A` is the quota for a single hash aggregator (equal to 16 - with the default value of `COCKROACH_VEC_MAX_OPEN_FDS` environment variable which is 256) and `R` is the quota for a single router output (2). This means that we can estimate that 20 FDs from each node are needed for the query to finish execution with 16 FDs being acquired first. Now imagine that this query is run with concurrency of 16. We can end up in such a situation that all hash aggregators have spilled, fully exhausting the global node limit on each node, so whenever the hash router outputs need to spill, they block forever since no FDs will ever be released, until a query is canceled or a node is shutdown. In other words, we have a deadlock. This commit fixes this situation by introducing a retry mechanism to exponentially backoff when trying to acquire the FD quota, until a time out. The randomizations provided by the `retry` package should be sufficient so that some of the queries succeed while others result in an error. Unfortunately, I don't see a way to prevent this deadlock from occurring in the first place without possible increase in latency in some case. The difficult thing is that we currently acquire FDs only once we need them, meaning once a particular component spills to disk. We could acquire the maximum number of FDs that a query might need up-front, before the query execution starts, but that could lead to starvation of the queries that ultimately won't spill to disk. This seems like a much worse impact than receiving timeout errors on some analytical queries when run with high concurrency. We're not an OLAP database, so this behavior seems ok. Release note (bug fix): Previously, CockroachDB could deadlock when evaluating analytical queries f multiple queries had to spill to disk at the same time. This is now fixed by making some of the queries error out instead.

This was added to track down the deadlock fixed in the previous commit, so we no longer need it. Release note: None

yuzefovich

Made a fix to a possibly infinite retry loop (I was misunderstanding was MaxBackoff was about). Also added another tiny commit, would appreciate another quick look.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @DrewKimball)

DrewKimball

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @DrewKimball)

yuzefovich · 2022-07-14T20:42:00Z

TFTR!

bors r+

craig · 2022-07-14T21:26:58Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-07-15T01:08:39Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-07-15T01:08:46Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 10c2cf9 to blathers/backport-release-22.1-84398: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

yuzefovich added the backport-22.1.x label Jul 13, 2022

yuzefovich requested review from DrewKimball, cucaroach and a team July 13, 2022 22:23

DrewKimball reviewed Jul 13, 2022

View reviewed changes

yuzefovich force-pushed the fix-vec-deadlock branch from 1c9a092 to e737e00 Compare July 14, 2022 01:08

yuzefovich commented Jul 14, 2022

View reviewed changes

yuzefovich force-pushed the fix-vec-deadlock branch from e737e00 to b5580b1 Compare July 14, 2022 01:25

DrewKimball approved these changes Jul 14, 2022

View reviewed changes

yuzefovich force-pushed the fix-vec-deadlock branch from b5580b1 to 865b874 Compare July 14, 2022 02:20

yuzefovich force-pushed the fix-vec-deadlock branch from 865b874 to 10c2cf9 Compare July 14, 2022 14:43

roachtest: remove some debugging printouts in tpch_concurrency

6189bb3

This was added to track down the deadlock fixed in the previous commit, so we no longer need it. Release note: None

yuzefovich commented Jul 14, 2022

View reviewed changes

DrewKimball approved these changes Jul 14, 2022

View reviewed changes

craig bot merged commit e8818d5 into cockroachdb:master Jul 15, 2022

yuzefovich deleted the fix-vec-deadlock branch July 15, 2022 01:30

nicktrav mentioned this pull request Jul 15, 2022

roachtest: sstable-corruption/table failed #84479

Closed

yuzefovich mentioned this pull request Jul 19, 2022

release-22.1: colflow: prevent deadlocks when many queries spill to disk at same time #84657

Merged

This was referenced Nov 12, 2022

sql: improve the behavior when hitting the limit on the number of open FDs for disk spilling #91784

Open

release-22.2: colexecdisk: limit number of FDs used by a single disk-backed op #91937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colflow: prevent deadlocks when many queries spill to disk at same time #84398

colflow: prevent deadlocks when many queries spill to disk at same time #84398

yuzefovich commented Jul 13, 2022 •

edited

Loading

cockroach-teamcity commented Jul 13, 2022

DrewKimball left a comment

yuzefovich left a comment

DrewKimball left a comment

yuzefovich left a comment

DrewKimball left a comment

yuzefovich commented Jul 14, 2022

craig bot commented Jul 14, 2022

craig bot commented Jul 15, 2022

blathers-crl bot commented Jul 15, 2022

colflow: prevent deadlocks when many queries spill to disk at same time #84398

colflow: prevent deadlocks when many queries spill to disk at same time #84398

Conversation

yuzefovich commented Jul 13, 2022 • edited Loading

cockroach-teamcity commented Jul 13, 2022

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

DrewKimball left a comment

Choose a reason for hiding this comment

yuzefovich commented Jul 14, 2022

craig bot commented Jul 14, 2022

craig bot commented Jul 15, 2022

blathers-crl bot commented Jul 15, 2022

yuzefovich commented Jul 13, 2022 •

edited

Loading