Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make process-wrapper wait for all children before exiting #10245

Closed
jmmv opened this issue Nov 15, 2019 · 4 comments
Closed

Make process-wrapper wait for all children before exiting #10245

jmmv opened this issue Nov 15, 2019 · 4 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@jmmv
Copy link
Contributor

jmmv commented Nov 15, 2019

The process-wrapper, which we use to run all actions from Bazel, runs its direct child in a new process group. This allows the process-wrapper to kill the whole process group while exiting to ensure there are no children processes left behind. (Note: this is not perfect because processes can "escape" the group, but it works reasonably well and making it "better" is out of scope for this bug--in part because the sandboxed strategy makes this a non-issue.)

Unfortunately, the process-wrapper only waits to collect the status of its direct child before exiting. Therefore, there is a possibility of a race condition: by the time Bazel regains control after process-wrapper returns, there could be some subchild processes left behind trying to process the signal. Given that signal delivery is not synchronous, those processes still have a chance of having side-effects.

With dynamic scheduling turned on and --experimental_local_lockfree_output, we risk those "still-dying" subprocesses to interfere with the outputs we download from remote execution. I think this is a remaining cause of build "flakiness" when enabling this feature combination. But to be honest, I'm not fully sure because the process-wrapper sends SIGKILL to the group and I don't know if there is a chance for the "still-dying" subprocesses to race us in this case. Plus the logs I see do not confirm this theory.

That said, the race exists in theory, so we should fix it by making the process-wrapper ensure that the whole process group is gone before returning.

@jmmv jmmv added type: bug P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team labels Nov 15, 2019
@jmmv jmmv added this to the Dynamic execution milestone Nov 15, 2019
@jmmv jmmv self-assigned this Nov 15, 2019
@jmmv
Copy link
Contributor Author

jmmv commented Nov 15, 2019

BTW I've done some research on how to fix this on Linux and macOS and have a pending change in review based on that. Read more here:

https://jmmv.dev/2019/11/bazel-process-wrapper.html
https://jmmv.dev/2019/11/wait-for-process-group.html
https://jmmv.dev/2019/11/wait-for-process-group-linux.html
https://jmmv.dev/2019/11/wait-for-process-group-darwin.html

;)

@jmmv
Copy link
Contributor Author

jmmv commented Dec 8, 2019

This fix was rolled back because it broke some downstream projects in 0db5301 (which didn't mention this issue).

The most likely problem is that, on Linux, with the use of the child subreaper feature, the process wrapper now takes charge of processes that escaped the process group—and doesn't terminate them, thus getting stuck while waiting for termination.

@jmmv jmmv reopened this Dec 8, 2019
@jmmv
Copy link
Contributor Author

jmmv commented May 6, 2020

Forgot to reopen this after my change was rolled back (months ago) a second time in c95d27a.

I'm now trying to make progress here again, but this time will add an extra flag to make the new behavior conditional. Which is tricky if I want this to not add a ton of technical debt...

@jmmv jmmv reopened this May 6, 2020
bazel-io pushed a commit that referenced this issue May 11, 2020
This will make it easier to carry the process-wrapper configuration around.
In particular, I want to add the ability to pass arbitrary extra flags to
the tool, and carrying them around explicitly and separately from the
process-wrapper path adds a lot of noise everywhere.

This also cleans up the way we gain access to the process-wrapper binary
and the checks to see if it is supported by homogenizing all logic into
a single place and removing OS-specific checks from a variety of places.

Part of #10245.

RELNOTES: None.
PiperOrigin-RevId: 310908570
bazel-io pushed a commit that referenced this issue May 12, 2020
The kill delay setting comes from the --local_termination_grace_seconds
option (it's not specific to individual process-wrapper invocations) and
we want every process-wrapper invocation to respect it.

In the past, we were propagating this value through various layers so
that we could reach the places where command lines were built... but we
can now take advantage of the ProcessWrapper object to pass this
information around.

Part of #10245.

RELNOTES: None.
PiperOrigin-RevId: 311112402
bazel-io pushed a commit that referenced this issue May 12, 2020
This adds a new --process_wrapper_extra_flags repeated flag that takes
extra flags to pass to the proces-wrapper.  These flags are appended to
the invocation built by Blaze, so we can use this to override builtin
computed values.

We'll use this to controlledly roll out the changes from 7828118
(and, later, 9c1853a), both of which required a Bazel release rollback
due to unforeseen problems.

Part of #10245.

RELNOTES: None.
PiperOrigin-RevId: 311162177
@jmmv jmmv reopened this May 18, 2020
@jmmv
Copy link
Contributor Author

jmmv commented May 18, 2020

b07b799 partially fixes this, but there is still a lingering bug and the fix is now behind a flag. So reopening.

bazel-io pushed a commit that referenced this issue May 20, 2020
Fix the new "wait for subprocesses" feature in the process-wrapper, which
is still behind the -W flag, to discard any zombies collected by the use
of the child subreaper feature if they happen to appear before the child
we care about finishes.  This prevents some tools from getting stuck.

Partially addresses #10245 as the
feature is still disabled by default.

RELNOTES: None.
PiperOrigin-RevId: 312536864
bazel-io pushed a commit that referenced this issue May 28, 2020
Passing -W (or --wait_fix) to the process wrapper enables a fix that has
caused quite a bit of trouble in the past.  Having the flag is good for
a controlled rollout, but we should also have a mechanism to undo the
effects of the rollout (so that when we flip the default in a Bazel
release users still have a way out).

To accomplish this, make --wait_fix take an optional argument that, if
is "no", restores the old (broken) behavior.

In preparation to address #10245.

RELNOTES: None.
PiperOrigin-RevId: 313577008
bazel-io pushed a commit that referenced this issue Jun 5, 2020
59183b6 added a very generic --process_wrapper_extra_flags flag to
make this possible, allowing us to pass a flag to the process-wrapper to
enable the bug fix without having to add custom code in Bazel.  While a
nice idea, this had the nasty side-effect of making the process-wrapper's
command line interface public and part of our API.

Undo that and instead add a --experimental_process_wrapper_wait_fix flag
specifically for this rollout, marked experimental so that we can remove
it once we are done.  Note that this still benefits from the internal
changes done to support "extra flags", as we reuse that machinery to pass
this around.

Part of #10245.

RELNOTES[INC]: This removes the short-lived --process_wrapper_extra_flags
flag, which was introduced primarily to roll out a bug fix.  Unfortunately,
this made us inadvertently expose all of the process-wrapper's command line
interface to the public, which should not have happened.  Given the corner
case of the utility of this flag, the lack of documentation for it, and the
fact that it only appeared in a single release, we are treating this as a
bug instead of a backwards compatibility breakage.

PiperOrigin-RevId: 314939927
bazel-io pushed a commit that referenced this issue Jun 19, 2020
This turns on the fix to properly wait for process groups in the
process-wrapper, which has been problematic in the past.

Part of #10245.

Tested:
  Verified downstream Bazel projects:
  https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1530

  Also checked that previous problems discovered internally at Google
  are resolved.

RELNOTES: None.
PiperOrigin-RevId: 317307293
ehkloaj pushed a commit to ehkloaj/bazel that referenced this issue Aug 6, 2020
This removes the --experimental_process_wrapper_wait_fix flag from Bazel
and the --wait_fix from the process-wrapper, both of which were used to
roll out the process tree wait fix.  These settings are now the default
and haven't shown issues, so it's safe to remove the flag.

Note that we cannot cleanly remove all code paths in the process-wrapper
because we still need to support old Linux versions that do not have the
child subreaper feature.

Fixes bazelbuild#10245.

RELNOTES[INC]: The --experimental_process_wrapper_wait_fix flag (used
purely to roll out a risky bug fix) has been removed.

PiperOrigin-RevId: 324604007
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    This turns on the fix to properly wait for process groups in the
    process-wrapper, which has been problematic in the past.

    Part of bazelbuild/bazel#10245.

    Tested:
      Verified downstream Bazel projects:
      https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1530

      Also checked that previous problems discovered internally at Google
      are resolved.

    RELNOTES: None.
    PiperOrigin-RevId: 317307293
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    This adds a new --process_wrapper_extra_flags repeated flag that takes
    extra flags to pass to the proces-wrapper.  These flags are appended to
    the invocation built by Blaze, so we can use this to override builtin
    computed values.

    We'll use this to controlledly roll out the changes from bazelbuild/bazel@7828118
    (and, later, bazelbuild/bazel@9c1853a), both of which required a Bazel release rollback
    due to unforeseen problems.

    Part of bazelbuild/bazel#10245.

    RELNOTES: None.
    PiperOrigin-RevId: 311162177
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    The kill delay setting comes from the --local_termination_grace_seconds
    option (it's not specific to individual process-wrapper invocations) and
    we want every process-wrapper invocation to respect it.

    In the past, we were propagating this value through various layers so
    that we could reach the places where command lines were built... but we
    can now take advantage of the ProcessWrapper object to pass this
    information around.

    Part of bazelbuild/bazel#10245.

    RELNOTES: None.
    PiperOrigin-RevId: 311112402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

No branches or pull requests

1 participant