Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add best-effort limits on async file opens to reduce file handle counts #20055

Merged
merged 2 commits into from
Oct 24, 2023

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Oct 18, 2023

As described in #19765, 2.17.x uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153).

In particular: rather than digesting files inside of spawn_blocking while capturing them into the LMDB store (imposing the limit of blocking threads from the tokio runtime), fn store moved to digesting them using tokio's async file APIs, which impose no such limit.

This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture.

Fixes #19765.

@stuhood stuhood added needs-cherrypick category:bugfix Bug fixes for released features labels Oct 18, 2023
@stuhood stuhood added this to the 2.17.x milestone Oct 18, 2023
@stuhood
Copy link
Member Author

stuhood commented Oct 18, 2023

Commits are useful to review independently.

@stuhood stuhood enabled auto-merge (squash) October 18, 2023 18:06
@stuhood stuhood disabled auto-merge October 18, 2023 18:06
@stuhood stuhood enabled auto-merge (squash) October 18, 2023 18:06
@tgolsson
Copy link
Contributor

Question: from the discussion and diagnosis on Slack, I get the impression that the root cause is this future::join_all spawning ungodly amounts of futures for large snapshots. Wouldn't it be more prudent to limit this to say 1024 parallel requests? It will not only achieve the same effect as far as I can tell, but should also be more efficient since all active futures will be doing work.

let file_digests = future::try_join_all(
files
.into_iter()
.map(|file| file_digester.store_by_digest(file))
.collect::<Vec<_>>(),
)
.await
.map_err(|e| format!("Failed to digest inputs: {e:?}"))?;

@tgolsson tgolsson disabled auto-merge October 18, 2023 19:19
@tgolsson
Copy link
Contributor

(Disabled auto-merge as discussed on Slack to avoid racing #20034)

@stuhood
Copy link
Member Author

stuhood commented Oct 18, 2023

Question: from the discussion and diagnosis on Slack, I get the impression that the root cause is this future::join_all spawning ungodly amounts of futures for large snapshots. Wouldn't it be more prudent to limit this to say 1024 parallel requests?

That join_all is not the only location which opens files, and so making the change there would apply the fix at a distance from the actual problem. The Store (which is primarily responsible for reading/writing files) is the source of the problem.

It will not only achieve the same effect as far as I can tell, but should also be more efficient since all active futures will be doing work.

I don't think that it will be more efficient, since it is less accurate in terms of what is covered by the Semaphore: the critical section is larger than necessary. For example, covering an entire file capture doesn't account for the fact that the LMDB store still only opens a bounded number of files to capture them, and so doesn't need to be limited in this way (see the comments).

@tgolsson
Copy link
Contributor

Thank you, keeping it as a more local guarantee makes sense.

I see your point about the scope being larger; but I don't see how it necessarily makes a difference. It looks like in either case below the first use, we'd very quickly reacquire the semaphore again or bottleneck down onto the sync threads. There's maybe work we can do in-between or after there that's meaningful, but since we don't have that many hardware threads that's only meaningful if we hit IO reliably and can swap tasks. Not saying that doesn't happen, but it seems like we're putting a potentially large amount of pressure on that poor semaphore and most tasks will just be blocked either way (for these large snapshots).

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this seems like it explains the symptoms very well, thank you!

It looks like there's a few other file-opening/access calls in local.rs; do they need consideration?

if let Ok(mut file) = tokio::fs::File::open(self.get_path(fingerprint)).await {

let named_temp_file = self
.executor
.spawn_blocking(
move || {
Builder::new()
.suffix(".tmp")
.tempfile_in(dest_path2.parent().unwrap())
.map_err(|e| format!("Failed to create temp file: {e}"))
},
|e| Err(format!("temp file creation task failed: {e}")),
)
.await?;


For the discussion of scope, I don't have a strong perspective on this specific case other than being nice to get 2.17.1 and 2.18.0 out sooner if possible!

However, the point about (try_)join_all seems like a good one: any call to them seems like it risks resource exhaustion, so maybe all calls of those functions should at least be justified and/or replaced by ones that manage concurrency?

(That said, naive limits on concurrency risk deadlocks (e.g. the first 1000 tasks depend on futures that are completed by the second 1000 tasks, and so if there's a join_all that only runs 1000, that'll wait forever. I don't know which is worse...)

@stuhood
Copy link
Member Author

stuhood commented Oct 23, 2023

It looks like there's a few other file-opening/access calls in local.rs; do they need consideration?

I don't think so, but I'd like to get this out to users so that they can try it: the funnel of requests is such that every request to store a Digest comes in through codepaths that are now covered by the semaphore. If the hypothesis from the description holds, then this should be sufficient.

@stuhood stuhood force-pushed the stuhood/async-fh-limits branch from 39b7f23 to 6bcea90 Compare October 24, 2023 03:09
@stuhood stuhood enabled auto-merge (squash) October 24, 2023 03:09
@stuhood
Copy link
Member Author

stuhood commented Oct 24, 2023

However, the point about (try_)join_all seems like a good one: any call to them seems like it risks resource exhaustion, so maybe all calls of those functions should at least be justified and/or replaced by ones that manage concurrency?

The key word here is "resource": which resource is going to be exhausted? Which is the most constrained resource for any particular try_join_all? In some cases it will be files, in others cores, and in others it will be a combination of both. Per-resource limits applied near the resource "producer" avoid scattering those concerns too broadly to anywhere where concurrency might occur.

@stuhood stuhood merged commit 2a93fd4 into pantsbuild:main Oct 24, 2023
WorkerPants pushed a commit that referenced this pull request Oct 24, 2023
…ts (#20055)

As described in #19765, `2.17.x` uses more file handles than previous
versions. Based on the location of the reported error, I suspect that
this is due to the move from using the LMDB store for all files, to
using the filesystem-based store for large files (#18153).

In particular: rather than digesting files inside of `spawn_blocking`
while capturing them into the LMDB store (imposing the [limit of
blocking
threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads)
from the tokio runtime), `fn store` moved to digesting them using
tokio's async file APIs, which impose no such limit.

This change adds a semaphore to (some) file opens to provide a
best-effort limit on files opened for the purposes of being captured. It
additionally (in the first commit) fixes an extraneous file handle that
was being kept open during capture.

Fixes #19765.
WorkerPants pushed a commit that referenced this pull request Oct 24, 2023
…ts (#20055)

As described in #19765, `2.17.x` uses more file handles than previous
versions. Based on the location of the reported error, I suspect that
this is due to the move from using the LMDB store for all files, to
using the filesystem-based store for large files (#18153).

In particular: rather than digesting files inside of `spawn_blocking`
while capturing them into the LMDB store (imposing the [limit of
blocking
threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads)
from the tokio runtime), `fn store` moved to digesting them using
tokio's async file APIs, which impose no such limit.

This change adds a semaphore to (some) file opens to provide a
best-effort limit on files opened for the purposes of being captured. It
additionally (in the first commit) fixes an extraneous file handle that
was being kept open during capture.

Fixes #19765.
@WorkerPants
Copy link
Member

I tried to automatically cherry-pick this change back to each relevant milestone, so that it is available in those older releases of Pants.

✔️ 2.17.x

Successfully opened #20078.

✔️ 2.18.x

Successfully opened #20077.


Thanks again for your contributions!

🤖 Beep Boop here's my run link

@huonw
Copy link
Contributor

huonw commented Oct 24, 2023

Makes sense.

Let me reframe my suggestion slightly: maybe we could check the current batch of (try_)join_all calls for resources that they may incur creation of, as guidance to find additional resource providers that could do with concurrency management. That said, that's probably a lot of effort, and we might feel comfortable continuing as we are: just solving the individual issues when they occur in practice.

huonw pushed a commit that referenced this pull request Oct 24, 2023
…ts (Cherry-pick of #20055) (#20078)

As described in #19765, `2.17.x` uses more file handles than previous
versions. Based on the location of the reported error, I suspect that
this is due to the move from using the LMDB store for all files, to
using the filesystem-based store for large files (#18153).

In particular: rather than digesting files inside of `spawn_blocking`
while capturing them into the LMDB store (imposing the [limit of
blocking
threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads)
from the tokio runtime), `fn store` moved to digesting them using
tokio's async file APIs, which impose no such limit.

This change adds a semaphore to (some) file opens to provide a
best-effort limit on files opened for the purposes of being captured. It
additionally (in the first commit) fixes an extraneous file handle that
was being kept open during capture.

Fixes #19765.

Co-authored-by: Stu Hood <[email protected]>
huonw pushed a commit that referenced this pull request Oct 24, 2023
…ts (Cherry-pick of #20055) (#20077)

As described in #19765, `2.17.x` uses more file handles than previous
versions. Based on the location of the reported error, I suspect that
this is due to the move from using the LMDB store for all files, to
using the filesystem-based store for large files (#18153).

In particular: rather than digesting files inside of `spawn_blocking`
while capturing them into the LMDB store (imposing the [limit of
blocking
threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads)
from the tokio runtime), `fn store` moved to digesting them using
tokio's async file APIs, which impose no such limit.

This change adds a semaphore to (some) file opens to provide a
best-effort limit on files opened for the purposes of being captured. It
additionally (in the first commit) fixes an extraneous file handle that
was being kept open during capture.

Fixes #19765.

Co-authored-by: Stu Hood <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:bugfix Bug fixes for released features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Too many open files" errors on 2.17
4 participants