Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: colflow: prevent deadlocks when many queries spill to disk at same time #84657

Merged
merged 3 commits into from
Jul 21, 2022

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jul 19, 2022

Backport 2/2 commits from #84398.
Backport 1/1 commits from #84684.

/cc @cockroachdb/release


colflow: prevent deadlocks when many queries spill to disk at same time

This commit fixes a long-standing issue which could cause
memory-intensive queries to deadlock on acquiring the file descriptors
quota when vectorized execution spills to disk. This bug has been
present since the introduction of disk-spilling (over two and a half
years ago, introduced in #45318 and partially mitigated in #45892), but
we haven't seen this in any user reports, only in tpch_concurrency
roachtest runs, so the severity seems pretty minor.

Consider the following query plan:

   Node 1                   Node 2

TableReader              TableReader
    |                         |
HashRouter                HashRouter
    |     \  ___________ /    |
    |      \/__________       |
    |      /           \      |
HashAggregator         HashAggregator

and let's imagine that each hash aggregator has to spill to disk. This
would require acquiring the file descriptors quota. Now, imagine that
because of that hash aggregators' spilling, each of the hash routers has
slow outputs causing them to spill too. As a result, this query plan can
require A + 2 * R number of FDs of a single node to succeed where A
is the quota for a single hash aggregator (equal to 16 - with the
default value of COCKROACH_VEC_MAX_OPEN_FDS environment variable which
is 256) and R is the quota for a single router output (2). This means
that we can estimate that 20 FDs from each node are needed for the query
to finish execution with 16 FDs being acquired first.

Now imagine that this query is run with concurrency of 16. We can end up
in such a situation that all hash aggregators have spilled, fully
exhausting the global node limit on each node, so whenever the hash
router outputs need to spill, they block forever since no FDs will ever
be released, until a query is canceled or a node is shutdown. In other
words, we have a deadlock.

This commit fixes this situation by introducing a retry mechanism to
exponentially backoff when trying to acquire the FD quota, until a time
out. The randomizations provided by the retry package should be
sufficient so that some of the queries succeed while others result in
an error.

Unfortunately, I don't see a way to prevent this deadlock from occurring
in the first place without possible increase in latency in some case.
The difficult thing is that we currently acquire FDs only once we need
them, meaning once a particular component spills to disk. We could
acquire the maximum number of FDs that a query might need up-front,
before the query execution starts, but that could lead to starvation of
the queries that ultimately won't spill to disk. This seems like a much
worse impact than receiving timeout errors on some analytical queries
when run with high concurrency. We're not an OLAP database, so this
behavior seems ok.

Fixes: #80290.

Release note (bug fix): Previously, CockroachDB could deadlock when
evaluating analytical queries if multiple queries had to spill to disk
at the same time. This is now fixed by making some of the queries error
out instead. If a user knows that there is no deadlock and that some
analytical queries that have spilled just taking too long, blocking
other queries from spilling, and is ok with waiting for longer, the user
can adjust newly introduced sql.distsql.acquire_vec_fds.max_retries
cluster setting (using 0 to get the previous behavior of indefinite
waiting until spilling resources open up).

roachtest: remove some debugging printouts in tpch_concurrency

This was added to track down the deadlock fixed in the previous commit,
so we no longer need it.

Release note: None

colflow: introduce a cluster setting for max retries of FD acquisition

We recently introduced a mechanism of retrying the acquision of file
descriptors needed for the disk-spilling queries to be able to get out
of a deadlock. We hard-coded the number of retries at 8, and this commit
makes that number configurable via a cluster setting (the idea is that
some users might be ok retrying for longer, so they will have an option
to do that). This cluster setting will also be the escape hatch to the
previous behavior (indefinite wait on Acquire) when the setting is set
to zero.

Release note: None

Release justification: bug fix.

@blathers-crl
Copy link

blathers-crl bot commented Jul 19, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have asked this on the original PR, but should there be a cluster/session setting to toggle this somehow, just in case it causes unexpected behavior on some production workloads?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach)

@yuzefovich
Copy link
Member Author

Hm, we could introduce a cluster setting that determines the number of retries performed (currently hard-coded at 8) with 0 meaning an infinite number (the previous behavior). This would actually allow us to remove some of the testing knobs introduced here. I can't really imagine someone ever tuning that setting, but I see no harm either, so I'll open up a corresponding PR on master.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I'll open up a corresponding PR on master.

Up to you - I just wanted to make sure we were considering the possibility.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

@yuzefovich
Copy link
Member Author

I think it's worth having the setting as an escape mechanism especially that we're backporting the fix.

This commit fixes a long-standing issue which could cause
memory-intensive queries to deadlock on acquiring the file descriptors
quota when vectorized execution spills to disk. This bug has been
present since the introduction of disk-spilling (over two and a half
years ago, introduced in cockroachdb#45318 and partially mitigated in cockroachdb#45892), but
we haven't seen this in any user reports, only in `tpch_concurrency`
roachtest runs, so the severity seems pretty minor.

Consider the following query plan:
```
   Node 1                   Node 2

TableReader              TableReader
    |                         |
HashRouter                HashRouter
    |     \  ___________ /    |
    |      \/__________       |
    |      /           \      |
HashAggregator         HashAggregator
```
and let's imagine that each hash aggregator has to spill to disk. This
would require acquiring the file descriptors quota. Now, imagine that
because of that hash aggregators' spilling, each of the hash routers has
slow outputs causing them to spill too. As a result, this query plan can
require `A + 2 * R` number of FDs of a single node to succeed where `A`
is the quota for a single hash aggregator (equal to 16 - with the
default value of `COCKROACH_VEC_MAX_OPEN_FDS` environment variable which
is 256) and `R` is the quota for a single router output (2). This means
that we can estimate that 20 FDs from each node are needed for the query
to finish execution with 16 FDs being acquired first.

Now imagine that this query is run with concurrency of 16. We can end up
in such a situation that all hash aggregators have spilled, fully
exhausting the global node limit on each node, so whenever the hash
router outputs need to spill, they block forever since no FDs will ever
be released, until a query is canceled or a node is shutdown. In other
words, we have a deadlock.

This commit fixes this situation by introducing a retry mechanism to
exponentially backoff when trying to acquire the FD quota, until a time
out. The randomizations provided by the `retry` package should be
sufficient so that some of the queries succeed while others result in
an error.

Unfortunately, I don't see a way to prevent this deadlock from occurring
in the first place without possible increase in latency in some case.
The difficult thing is that we currently acquire FDs only once we need
them, meaning once a particular component spills to disk. We could
acquire the maximum number of FDs that a query might need up-front,
before the query execution starts, but that could lead to starvation of
the queries that ultimately won't spill to disk. This seems like a much
worse impact than receiving timeout errors on some analytical queries
when run with high concurrency. We're not an OLAP database, so this
behavior seems ok.

Release note (bug fix): Previously, CockroachDB could deadlock when
evaluating analytical queries if multiple queries had to spill to disk
at the same time. This is now fixed by making some of the queries error
out instead. If a user knows that there is no deadlock and that some
analytical queries that have spilled just taking too long, blocking
other queries from spilling, and is ok with waiting for longer, the user
can adjust newly introduced `sql.distsql.acquire_vec_fds.max_retries`
cluster setting (using 0 to get the previous behavior of indefinite
waiting until spilling resources open up).
This was added to track down the deadlock fixed in the previous commit,
so we no longer need it.

Release note: None
We recently introduced a mechanism of retrying the acquision of file
descriptors needed for the disk-spilling queries to be able to get out
of a deadlock. We hard-coded the number of retries at 8, and this commit
makes that number configurable via a cluster setting (the idea is that
some users might be ok retrying for longer, so they will have an option
to do that). This cluster setting will also be the escape hatch to the
previous behavior (indefinite wait on `Acquire`) when the setting is set
to zero.

Release note: None
@yuzefovich yuzefovich force-pushed the backport22.1-84398 branch from 3b465b3 to 9d9bf55 Compare July 20, 2022 18:56
@yuzefovich
Copy link
Member Author

I included #84684 into this backport and adjusted the release note to include the mention of the cluster setting but decided to not squash the commits to better preserve the history. PTAL.

@yuzefovich yuzefovich merged commit ac6ee5e into cockroachdb:release-22.1 Jul 21, 2022
@yuzefovich yuzefovich deleted the backport22.1-84398 branch July 21, 2022 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants