-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add best-effort limits on async file opens to reduce file handle counts #20055
Conversation
Commits are useful to review independently. |
Question: from the discussion and diagnosis on Slack, I get the impression that the root cause is this future::join_all spawning ungodly amounts of futures for large snapshots. Wouldn't it be more prudent to limit this to say 1024 parallel requests? It will not only achieve the same effect as far as I can tell, but should also be more efficient since all active futures will be doing work. pants/src/rust/engine/fs/store/src/snapshot.rs Lines 97 to 104 in 501d537
|
(Disabled auto-merge as discussed on Slack to avoid racing #20034) |
That
I don't think that it will be more efficient, since it is less accurate in terms of what is covered by the |
Thank you, keeping it as a more local guarantee makes sense. I see your point about the scope being larger; but I don't see how it necessarily makes a difference. It looks like in either case below the first use, we'd very quickly reacquire the semaphore again or bottleneck down onto the sync threads. There's maybe work we can do in-between or after there that's meaningful, but since we don't have that many hardware threads that's only meaningful if we hit IO reliably and can swap tasks. Not saying that doesn't happen, but it seems like we're putting a potentially large amount of pressure on that poor semaphore and most tasks will just be blocked either way (for these large snapshots). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this seems like it explains the symptoms very well, thank you!
It looks like there's a few other file-opening/access calls in local.rs
; do they need consideration?
pants/src/rust/engine/fs/store/src/local.rs
Line 458 in 39b7f23
if let Ok(mut file) = tokio::fs::File::open(self.get_path(fingerprint)).await { |
pants/src/rust/engine/fs/store/src/local.rs
Lines 291 to 302 in 39b7f23
let named_temp_file = self | |
.executor | |
.spawn_blocking( | |
move || { | |
Builder::new() | |
.suffix(".tmp") | |
.tempfile_in(dest_path2.parent().unwrap()) | |
.map_err(|e| format!("Failed to create temp file: {e}")) | |
}, | |
|e| Err(format!("temp file creation task failed: {e}")), | |
) | |
.await?; |
For the discussion of scope, I don't have a strong perspective on this specific case other than being nice to get 2.17.1 and 2.18.0 out sooner if possible!
However, the point about (try_)join_all
seems like a good one: any call to them seems like it risks resource exhaustion, so maybe all calls of those functions should at least be justified and/or replaced by ones that manage concurrency?
(That said, naive limits on concurrency risk deadlocks (e.g. the first 1000 tasks depend on futures that are completed by the second 1000 tasks, and so if there's a join_all that only runs 1000, that'll wait forever. I don't know which is worse...)
I don't think so, but I'd like to get this out to users so that they can try it: the funnel of requests is such that every request to store a |
39b7f23
to
6bcea90
Compare
The key word here is "resource": which resource is going to be exhausted? Which is the most constrained resource for any particular |
…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.
…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.
Makes sense. Let me reframe my suggestion slightly: maybe we could check the current batch of |
…ts (Cherry-pick of #20055) (#20078) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <[email protected]>
…ts (Cherry-pick of #20055) (#20077) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <[email protected]>
As described in #19765,
2.17.x
uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153).In particular: rather than digesting files inside of
spawn_blocking
while capturing them into the LMDB store (imposing the limit of blocking threads from the tokio runtime),fn store
moved to digesting them using tokio's async file APIs, which impose no such limit.This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture.
Fixes #19765.