-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make process-wrapper wait for all children before exiting #10245
Comments
BTW I've done some research on how to fix this on Linux and macOS and have a pending change in review based on that. Read more here: https://jmmv.dev/2019/11/bazel-process-wrapper.html ;) |
This fix was rolled back because it broke some downstream projects in 0db5301 (which didn't mention this issue). The most likely problem is that, on Linux, with the use of the child subreaper feature, the process wrapper now takes charge of processes that escaped the process group—and doesn't terminate them, thus getting stuck while waiting for termination. |
Forgot to reopen this after my change was rolled back (months ago) a second time in c95d27a. I'm now trying to make progress here again, but this time will add an extra flag to make the new behavior conditional. Which is tricky if I want this to not add a ton of technical debt... |
This will make it easier to carry the process-wrapper configuration around. In particular, I want to add the ability to pass arbitrary extra flags to the tool, and carrying them around explicitly and separately from the process-wrapper path adds a lot of noise everywhere. This also cleans up the way we gain access to the process-wrapper binary and the checks to see if it is supported by homogenizing all logic into a single place and removing OS-specific checks from a variety of places. Part of #10245. RELNOTES: None. PiperOrigin-RevId: 310908570
The kill delay setting comes from the --local_termination_grace_seconds option (it's not specific to individual process-wrapper invocations) and we want every process-wrapper invocation to respect it. In the past, we were propagating this value through various layers so that we could reach the places where command lines were built... but we can now take advantage of the ProcessWrapper object to pass this information around. Part of #10245. RELNOTES: None. PiperOrigin-RevId: 311112402
This adds a new --process_wrapper_extra_flags repeated flag that takes extra flags to pass to the proces-wrapper. These flags are appended to the invocation built by Blaze, so we can use this to override builtin computed values. We'll use this to controlledly roll out the changes from 7828118 (and, later, 9c1853a), both of which required a Bazel release rollback due to unforeseen problems. Part of #10245. RELNOTES: None. PiperOrigin-RevId: 311162177
b07b799 partially fixes this, but there is still a lingering bug and the fix is now behind a flag. So reopening. |
Fix the new "wait for subprocesses" feature in the process-wrapper, which is still behind the -W flag, to discard any zombies collected by the use of the child subreaper feature if they happen to appear before the child we care about finishes. This prevents some tools from getting stuck. Partially addresses #10245 as the feature is still disabled by default. RELNOTES: None. PiperOrigin-RevId: 312536864
Passing -W (or --wait_fix) to the process wrapper enables a fix that has caused quite a bit of trouble in the past. Having the flag is good for a controlled rollout, but we should also have a mechanism to undo the effects of the rollout (so that when we flip the default in a Bazel release users still have a way out). To accomplish this, make --wait_fix take an optional argument that, if is "no", restores the old (broken) behavior. In preparation to address #10245. RELNOTES: None. PiperOrigin-RevId: 313577008
59183b6 added a very generic --process_wrapper_extra_flags flag to make this possible, allowing us to pass a flag to the process-wrapper to enable the bug fix without having to add custom code in Bazel. While a nice idea, this had the nasty side-effect of making the process-wrapper's command line interface public and part of our API. Undo that and instead add a --experimental_process_wrapper_wait_fix flag specifically for this rollout, marked experimental so that we can remove it once we are done. Note that this still benefits from the internal changes done to support "extra flags", as we reuse that machinery to pass this around. Part of #10245. RELNOTES[INC]: This removes the short-lived --process_wrapper_extra_flags flag, which was introduced primarily to roll out a bug fix. Unfortunately, this made us inadvertently expose all of the process-wrapper's command line interface to the public, which should not have happened. Given the corner case of the utility of this flag, the lack of documentation for it, and the fact that it only appeared in a single release, we are treating this as a bug instead of a backwards compatibility breakage. PiperOrigin-RevId: 314939927
This turns on the fix to properly wait for process groups in the process-wrapper, which has been problematic in the past. Part of #10245. Tested: Verified downstream Bazel projects: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1530 Also checked that previous problems discovered internally at Google are resolved. RELNOTES: None. PiperOrigin-RevId: 317307293
This removes the --experimental_process_wrapper_wait_fix flag from Bazel and the --wait_fix from the process-wrapper, both of which were used to roll out the process tree wait fix. These settings are now the default and haven't shown issues, so it's safe to remove the flag. Note that we cannot cleanly remove all code paths in the process-wrapper because we still need to support old Linux versions that do not have the child subreaper feature. Fixes bazelbuild#10245. RELNOTES[INC]: The --experimental_process_wrapper_wait_fix flag (used purely to roll out a risky bug fix) has been removed. PiperOrigin-RevId: 324604007
This turns on the fix to properly wait for process groups in the process-wrapper, which has been problematic in the past. Part of bazelbuild/bazel#10245. Tested: Verified downstream Bazel projects: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1530 Also checked that previous problems discovered internally at Google are resolved. RELNOTES: None. PiperOrigin-RevId: 317307293
This adds a new --process_wrapper_extra_flags repeated flag that takes extra flags to pass to the proces-wrapper. These flags are appended to the invocation built by Blaze, so we can use this to override builtin computed values. We'll use this to controlledly roll out the changes from bazelbuild/bazel@7828118 (and, later, bazelbuild/bazel@9c1853a), both of which required a Bazel release rollback due to unforeseen problems. Part of bazelbuild/bazel#10245. RELNOTES: None. PiperOrigin-RevId: 311162177
The kill delay setting comes from the --local_termination_grace_seconds option (it's not specific to individual process-wrapper invocations) and we want every process-wrapper invocation to respect it. In the past, we were propagating this value through various layers so that we could reach the places where command lines were built... but we can now take advantage of the ProcessWrapper object to pass this information around. Part of bazelbuild/bazel#10245. RELNOTES: None. PiperOrigin-RevId: 311112402
The
process-wrapper
, which we use to run all actions from Bazel, runs its direct child in a new process group. This allows theprocess-wrapper
to kill the whole process group while exiting to ensure there are no children processes left behind. (Note: this is not perfect because processes can "escape" the group, but it works reasonably well and making it "better" is out of scope for this bug--in part because the sandboxed strategy makes this a non-issue.)Unfortunately, the
process-wrapper
only waits to collect the status of its direct child before exiting. Therefore, there is a possibility of a race condition: by the time Bazel regains control afterprocess-wrapper
returns, there could be some subchild processes left behind trying to process the signal. Given that signal delivery is not synchronous, those processes still have a chance of having side-effects.With dynamic scheduling turned on and
--experimental_local_lockfree_output
, we risk those "still-dying" subprocesses to interfere with the outputs we download from remote execution. I think this is a remaining cause of build "flakiness" when enabling this feature combination. But to be honest, I'm not fully sure because theprocess-wrapper
sendsSIGKILL
to the group and I don't know if there is a chance for the "still-dying" subprocesses to race us in this case. Plus the logs I see do not confirm this theory.That said, the race exists in theory, so we should fix it by making the
process-wrapper
ensure that the whole process group is gone before returning.The text was updated successfully, but these errors were encountered: