Add best-effort limits on async file opens to reduce file handle counts #20055

stuhood · 2023-10-18T18:05:42Z

As described in #19765, 2.17.x uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153).

In particular: rather than digesting files inside of spawn_blocking while capturing them into the LMDB store (imposing the limit of blocking threads from the tokio runtime), fn store moved to digesting them using tokio's async file APIs, which impose no such limit.

This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture.

Fixes #19765.

stuhood · 2023-10-18T18:05:59Z

Commits are useful to review independently.

tgolsson · 2023-10-18T19:12:23Z

Question: from the discussion and diagnosis on Slack, I get the impression that the root cause is this future::join_all spawning ungodly amounts of futures for large snapshots. Wouldn't it be more prudent to limit this to say 1024 parallel requests? It will not only achieve the same effect as far as I can tell, but should also be more efficient since all active futures will be doing work.

pants/src/rust/engine/fs/store/src/snapshot.rs

Lines 97 to 104 in 501d537

    
           let file_digests = future::try_join_all( 
        
             files 
        
               .into_iter() 
        
               .map(|file| file_digester.store_by_digest(file)) 
        
               .collect::<Vec<_>>(), 
        
           ) 
        
           .await 
        
           .map_err(|e| format!("Failed to digest inputs: {e:?}"))?;

tgolsson · 2023-10-18T19:20:54Z

(Disabled auto-merge as discussed on Slack to avoid racing #20034)

stuhood · 2023-10-18T20:21:09Z

Question: from the discussion and diagnosis on Slack, I get the impression that the root cause is this future::join_all spawning ungodly amounts of futures for large snapshots. Wouldn't it be more prudent to limit this to say 1024 parallel requests?

That join_all is not the only location which opens files, and so making the change there would apply the fix at a distance from the actual problem. The Store (which is primarily responsible for reading/writing files) is the source of the problem.

It will not only achieve the same effect as far as I can tell, but should also be more efficient since all active futures will be doing work.

I don't think that it will be more efficient, since it is less accurate in terms of what is covered by the Semaphore: the critical section is larger than necessary. For example, covering an entire file capture doesn't account for the fact that the LMDB store still only opens a bounded number of files to capture them, and so doesn't need to be limited in this way (see the comments).

tgolsson · 2023-10-18T21:53:47Z

Thank you, keeping it as a more local guarantee makes sense.

I see your point about the scope being larger; but I don't see how it necessarily makes a difference. It looks like in either case below the first use, we'd very quickly reacquire the semaphore again or bottleneck down onto the sync threads. There's maybe work we can do in-between or after there that's meaningful, but since we don't have that many hardware threads that's only meaningful if we hit IO reliably and can swap tasks. Not saying that doesn't happen, but it seems like we're putting a potentially large amount of pressure on that poor semaphore and most tasks will just be blocked either way (for these large snapshots).

huonw

Ah, this seems like it explains the symptoms very well, thank you!

It looks like there's a few other file-opening/access calls in local.rs; do they need consideration?

pants/src/rust/engine/fs/store/src/local.rs

Line 458 in 39b7f23

if let Ok(mut file) = tokio::fs::File::open(self.get_path(fingerprint)).await {

pants/src/rust/engine/fs/store/src/local.rs

Lines 291 to 302 in 39b7f23

    
           let named_temp_file = self 
        
             .executor 
        
             .spawn_blocking( 
        
               move || { 
        
                 Builder::new() 
        
                   .suffix(".tmp") 
        
                   .tempfile_in(dest_path2.parent().unwrap()) 
        
                   .map_err(|e| format!("Failed to create temp file: {e}")) 
        
               }, 
        
               |e| Err(format!("temp file creation task failed: {e}")), 
        
             ) 
        
             .await?;

For the discussion of scope, I don't have a strong perspective on this specific case other than being nice to get 2.17.1 and 2.18.0 out sooner if possible!

However, the point about (try_)join_all seems like a good one: any call to them seems like it risks resource exhaustion, so maybe all calls of those functions should at least be justified and/or replaced by ones that manage concurrency?

(That said, naive limits on concurrency risk deadlocks (e.g. the first 1000 tasks depend on futures that are completed by the second 1000 tasks, and so if there's a join_all that only runs 1000, that'll wait forever. I don't know which is worse...)

stuhood · 2023-10-23T23:34:23Z

It looks like there's a few other file-opening/access calls in local.rs; do they need consideration?

I don't think so, but I'd like to get this out to users so that they can try it: the funnel of requests is such that every request to store a Digest comes in through codepaths that are now covered by the semaphore. If the hypothesis from the description holds, then this should be sufficient.

stuhood · 2023-10-24T03:12:40Z

However, the point about (try_)join_all seems like a good one: any call to them seems like it risks resource exhaustion, so maybe all calls of those functions should at least be justified and/or replaced by ones that manage concurrency?

The key word here is "resource": which resource is going to be exhausted? Which is the most constrained resource for any particular try_join_all? In some cases it will be files, in others cores, and in others it will be a combination of both. Per-resource limits applied near the resource "producer" avoid scattering those concerns too broadly to anywhere where concurrency might occur.

…ts (#20055) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765.

WorkerPants · 2023-10-24T04:14:59Z

I tried to automatically cherry-pick this change back to each relevant milestone, so that it is available in those older releases of Pants.

✔️ 2.17.x

Successfully opened #20078.

✔️ 2.18.x

Successfully opened #20077.

Thanks again for your contributions!

🤖 Beep Boop here's my run link

huonw · 2023-10-24T04:18:00Z

Makes sense.

Let me reframe my suggestion slightly: maybe we could check the current batch of (try_)join_all calls for resources that they may incur creation of, as guidance to find additional resource providers that could do with concurrency management. That said, that's probably a lot of effort, and we might feel comfortable continuing as we are: just solving the individual issues when they occur in practice.

…ts (Cherry-pick of #20055) (#20078) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <[email protected]>

…ts (Cherry-pick of #20055) (#20077) As described in #19765, `2.17.x` uses more file handles than previous versions. Based on the location of the reported error, I suspect that this is due to the move from using the LMDB store for all files, to using the filesystem-based store for large files (#18153). In particular: rather than digesting files inside of `spawn_blocking` while capturing them into the LMDB store (imposing the [limit of blocking threads](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads) from the tokio runtime), `fn store` moved to digesting them using tokio's async file APIs, which impose no such limit. This change adds a semaphore to (some) file opens to provide a best-effort limit on files opened for the purposes of being captured. It additionally (in the first commit) fixes an extraneous file handle that was being kept open during capture. Fixes #19765. Co-authored-by: Stu Hood <[email protected]>

stuhood added needs-cherrypick category:bugfix Bug fixes for released features labels Oct 18, 2023

stuhood added this to the 2.17.x milestone Oct 18, 2023

stuhood requested review from huonw, danxmoran and thejcannon October 18, 2023 18:05

stuhood enabled auto-merge (squash) October 18, 2023 18:06

stuhood disabled auto-merge October 18, 2023 18:06

stuhood enabled auto-merge (squash) October 18, 2023 18:06

stuhood mentioned this pull request Oct 18, 2023

Enable use of kernel copies of files in more cases #20056

Closed

tgolsson disabled auto-merge October 18, 2023 19:19

huonw approved these changes Oct 23, 2023

View reviewed changes

stuhood added 2 commits October 23, 2023 20:07

Release file handle while capturing.

de5a6cb

Add a semaphore to bound opens, and an open_readonly helper.

6bcea90

stuhood force-pushed the stuhood/async-fh-limits branch from 39b7f23 to 6bcea90 Compare October 24, 2023 03:09

stuhood enabled auto-merge (squash) October 24, 2023 03:09

stuhood merged commit 2a93fd4 into pantsbuild:main Oct 24, 2023

WorkerPants removed the needs-cherrypick label Oct 24, 2023

benjyw mentioned this pull request Oct 24, 2023

"Too many open files" errors on 2.17 #19765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add best-effort limits on async file opens to reduce file handle counts #20055

Add best-effort limits on async file opens to reduce file handle counts #20055

stuhood commented Oct 18, 2023 •

edited

Loading

stuhood commented Oct 18, 2023

tgolsson commented Oct 18, 2023

tgolsson commented Oct 18, 2023

stuhood commented Oct 18, 2023

tgolsson commented Oct 18, 2023

huonw left a comment

stuhood commented Oct 23, 2023

stuhood commented Oct 24, 2023

WorkerPants commented Oct 24, 2023

huonw commented Oct 24, 2023

	let named_temp_file = self
	.executor
	.spawn_blocking(
	move \|\| {
	Builder::new()
	.suffix(".tmp")
	.tempfile_in(dest_path2.parent().unwrap())
	.map_err(\|e\| format!("Failed to create temp file: {e}"))
	},
	\|e\| Err(format!("temp file creation task failed: {e}")),
	)
	.await?;

Add best-effort limits on async file opens to reduce file handle counts #20055

Add best-effort limits on async file opens to reduce file handle counts #20055

Conversation

stuhood commented Oct 18, 2023 • edited Loading

stuhood commented Oct 18, 2023

tgolsson commented Oct 18, 2023

tgolsson commented Oct 18, 2023

stuhood commented Oct 18, 2023

tgolsson commented Oct 18, 2023

huonw left a comment

Choose a reason for hiding this comment

stuhood commented Oct 23, 2023

stuhood commented Oct 24, 2023

WorkerPants commented Oct 24, 2023

✔️ 2.17.x

✔️ 2.18.x

huonw commented Oct 24, 2023

stuhood commented Oct 18, 2023 •

edited

Loading