-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text file busy when attempting to execute script written into sandbox #10507
Comments
Grr this blocks #10442 https://travis-ci.com/github/pantsbuild/pants/jobs/366894459#L776. |
To reproduce, apply this diff: diff --git a/src/python/pants/engine/process.py b/src/python/pants/engine/process.py
index 2dbb58714..0001946d2 100644
--- a/src/python/pants/engine/process.py
+++ b/src/python/pants/engine/process.py
@@ -336,12 +336,12 @@ class BinaryPaths(EngineAware):
@rule(desc="Find binary path")
async def find_binary(request: BinaryPathRequest) -> BinaryPaths:
# TODO(John Sirois): Replace this script with a statically linked native binary so we don't
- # depend on either /bin/bash being available on the Process host.
- # TODO(#10507): Running the script directly from a shebang sometimes results in a "Text file
- # busy" error.
- script_path = "./script.sh"
+ # depend on either `/usr/bin/env` or bash being available on the Process host.
+ script_path = "./find_binary.sh"
script_content = dedent(
"""
+ #!/usr/bin/env bash
+
set -euo pipefail
if command -v which > /dev/null; then
@@ -363,7 +363,7 @@ async def find_binary(request: BinaryPathRequest) -> BinaryPaths:
Process(
description=f"Searching for `{request.binary_name}` on PATH={search_path}",
input_digest=script_digest,
- argv=["/bin/bash", script_path, request.binary_name],
+ argv=[script_path, request.binary_name],
env={"PATH": search_path},
),
) Then run something like I suspect that this only happens on Linux. I haven't been able to get this error locally on macOS. On my Ubuntu VM, I hit a different error that none of the interpreters were discovered in the first place with this change, rather than hitting Error Code 26. In CI, we've hit Error Code 26 a few times. The current hypothesis is that we're not calling We call pants/src/rust/engine/process_execution/src/local.rs Lines 365 to 371 in 0e1c8de
If you follow pants/src/rust/engine/fs/store/src/lib.rs Lines 738 to 755 in 0e1c8de
The hypothesis is that for some reason that is not enough. We want to experiment if calling pants/src/rust/engine/process_execution/src/local.rs Lines 419 to 421 in 0e1c8de
(I'm not sure what that snippet will be. Stu had an idea). |
It seems likely that this is docker related. This thread is interesting: moby/moby#9547 |
Will see if the nuclear option (a global |
That sounds plausible, but I'm not sure Docker is it. We saw this error on the |
### Problem As described in #10507 we suspect that use of docker and AUFS results in a need for heavier handed measures for syncing files to the filesystem before executing them. Various other strategies were tried and ruled out in #10557. ### Solution If we detect that we are probably running in docker or an LXC container, spawn a `sync` process before executing. ### Result Fixes #10507. [ci skip-build-wheels]
Previously our materialize_directory tail end fsync could wipe out file metadata if the os had already flushed the initial create. Thread through the full create metadata to allow our tail end fsync to be faithful. Fixes pantsbuild#10507 [ci skip-build-wheels]
Previously our materialize_directory tail end fsync could wipe out file metadata if the os had already flushed the initial create. Thread through the full create metadata to allow our tail end fsync to be faithful. Fixes pantsbuild#10507 [ci skip-build-wheels]
Still happens :/ https://travis-ci.com/github/pantsbuild/pants/jobs/369865059#L952 |
Yeah, see: https://stackoverflow.com/questions/37288453/calling-fsync2-after-close2 That and its pointers say (open, write, close, open, fsync, close) != (open, fsync, close) on various Linux kernels. We'll need to always call |
It's generically unsafe to fork+exec against binaries written out in a multithreaded program when concurrent forks are possible. Even if all files are opened for writing with O_CLOEXEC (which is the case for Rust) if thread1 opens a file for writing and then thread2 forks, thread2 will hold an open file descriptor. If thread2's subsequent exec is delayed past the fork+exec thread1 does against the file it wrote, then the thread1 fork+exec'd process will encounter ETXTBSY. OSX "solves" this by retrying some number of times when it hits ETXTBSY, but Linux does not attempt this hack. O_CLOFORK has been proposed and adopted by some unices, but not Linux. As such we need to make some tradeoff to allow this use case. This change introduces a lock around process spawns (fork+exec) to prevent interleaved fork / exec whenever there is a spawn that we know exec's a binary we wrote out. Fixes pantsbuild#10507 [ci skip-build-wheels]
Wrapping things up here:
#10577 works around 3 with a fork+exec lock. |
It's generically unsafe to fork+exec against binaries written out in a multithreaded program when concurrent forks are possible. Even if all files are opened for writing with O_CLOEXEC (which is the case for Rust) if thread1 opens a file for writing and then thread2 forks, thread2 will hold an open file descriptor. If thread2's subsequent exec is delayed past the fork+exec thread1 does against the file it wrote, then the thread1 fork+exec'd process will encounter ETXTBSY. OSX "solves" this by retrying some number of times when it hits ETXTBSY, but Linux does not attempt this hack. O_CLOFORK has been proposed and adopted by some unices, but not Linux. As such we need to make some tradeoff to allow this use case. This change introduces a lock around process spawns (fork+exec) to prevent interleaved fork / exec whenever there is a spawn that we know exec's a binary we wrote out. Fixes pantsbuild#10507 [ci skip-build-wheels]
It's generically unsafe to fork+exec against binaries written out in a multithreaded program when concurrent forks are possible. Even if all files are opened for writing with O_CLOEXEC (which is the case for Rust) if thread1 opens a file for writing and then thread2 forks, thread2 will hold an open file descriptor. If thread2's subsequent exec is delayed past the fork+exec thread1 does against the file it wrote, then the thread1 fork+exec'd process will encounter ETXTBSY. OSX "solves" this by retrying some number of times when it hits ETXTBSY, but Linux does not attempt this hack. O_CLOFORK has been proposed and adopted by some unices, but not Linux. As such we need to make some tradeoff to allow this use case. This change introduces a lock around process spawns (fork+exec) to prevent interleaved fork / exec whenever there is a spawn that we know exec's a binary we wrote out. Fixes pantsbuild#10507 [ci skip-build-wheels]
It's generically unsafe to fork+exec against binaries written out in a multithreaded program when concurrent forks are possible. Even if all files are opened for writing with O_CLOEXEC (which is the case for Rust) if thread1 opens a file for writing and then thread2 forks, thread2 will hold an open file descriptor. If thread2's subsequent exec is delayed past the fork+exec thread1 does against the file it wrote, then the thread1 fork+exec'd process will encounter ETXTBSY. OSX "solves" this by retrying some number of times when it hits ETXTBSY, but Linux does not attempt this hack. O_CLOFORK has been proposed and adopted by some unices, but not Linux. As such we need to make some tradeoff to allow this use case. This change introduces a lock around process spawns (fork+exec) to prevent interleaved fork / exec whenever there is a spawn that we know exec's a binary we wrote out. Fixes pantsbuild#10507 [ci skip-build-wheels]
It's generically unsafe to fork+exec against binaries written out in a multithreaded program when concurrent forks are possible. Even if all files are opened for writing with O_CLOEXEC (which is the case for Rust) if thread1 opens a file for writing and then thread2 forks, thread2 will hold an open file descriptor. If thread2's subsequent exec is delayed past the fork+exec thread1 does against the file it wrote, then the thread1 fork+exec'd process will encounter ETXTBSY. OSX "solves" this by retrying some number of times when it hits ETXTBSY, but Linux does not attempt this hack. O_CLOFORK has been proposed and adopted by some unices, but not Linux. As such we need to make some tradeoff to allow this use case. This change introduces a lock around process spawns (fork+exec) to prevent interleaved fork / exec whenever there is a spawn that we know exec's a binary we wrote out. Fixes pantsbuild#10507 [ci skip-build-wheels]
It's generically unsafe to fork+exec against binaries written out in a multithreaded program when concurrent forks are possible. Even if all files are opened for writing with O_CLOEXEC (which is the case for Rust) if thread1 opens a file for writing and then thread2 forks, thread2 will hold an open file descriptor. If thread2's subsequent exec is delayed past the fork+exec thread1 does against the file it wrote, then the thread1 fork+exec'd process will encounter ETXTBSY. OSX "solves" this by retrying some number of times when it hits ETXTBSY, but Linux does not attempt this hack. O_CLOFORK has been proposed and adopted by some unices, but not Linux. As such we need to make some tradeoff to allow this use case. This change introduces a lock around process spawns (fork+exec) to prevent interleaved fork / exec whenever there is a spawn that we know exec's a binary we wrote out. Since forks can still happen in our process space outside of our control (in libraries), this change also introduces a bounded ETXTBSY retry loop as a fallback for binaries we wrote out. Fixes #10507
For completeness, some postscript on #10577: Although controlling forks in a central location and serializing fork/exec against any binaries we write out solves the issue for code we control, it does not solve the general problem when we do not control all code (libraries we link against). As such, #10577 added a bounded retry loop to protect against uncontrolled forks. An alternative approach requiring no explicit locks or retry loop workarounds would be to write out binaries in a seperate multithreaded process that does no forking at all. We'd need to wire up a communication channel (pipes) to send digests that need to be written out and be able to await completion of the writing. That form of solution would probably pair well with our still unused brfs fuse filesystem. To use that filesystem, we'll want to keep it mounted at least as longs as pantsd stays up if not longer. Its easy to imagine a second Pants daemon with a lifttime >= pantsd that:
Note that there is a fork+exec here, but only 1 and on daemon startup before an event loop is ever started. Note also that we finally leverage brfs which should significantly improve digest materialization performance - no data copying, just symlinks that point directly into lmdb memory. |
Would brfs even suffer from this, given that you never "open" the executable file for writing? It just "exists" from the get go when read by a client? |
You're right, it would not since, critically, we would use it asymmetrically and write directly to LMDB (not through the brfs filesystem) and only read from the brfs filesystem. If we wrote through the brfs filesystem interface we'd have the same problem. |
Using a
Process
to invoke anis_executable=True
script can result in:We have tests that cover this behavior:
pants/src/python/pants/engine/process_test.py
Lines 250 to 277 in 49f5a0a
The text was updated successfully, but these errors were encountered: