-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BenchExec subprocess hangs in __malloc_fork_lock_parent #656
Comments
There are no changes since then that should be improve the behavior in this regard. In principle a log with In particular, due to the high number of parallel executions you are using, I suspect that this is #435. Typically this happens only with highly-loaded systems (many executions on few cores) and I thought it should be fixed completely for Python 3.7+ on Linux x86_64, but maybe it is not. Note that it does not always happen for the last task, it is just that BenchExec continues to execute other tasks until only the one with the deadlock remains. How many cores does your machine have, how many cores do you allow to use per task, and are there other processes running at the same time that use significant CPU resources? To double check something, could you please run Furthermore, it would be nice if you could follow our FAQ entry on this and provide some BenchExec also has another mitigation for this issue, which is just disabled for systems for which we thought the issue would not occur. You could test whether this mitigation helps for you by applying 656-workaround.patch.txt on your BenchExec installation locally and trying to reproduce the problem. If that improves the situation, we could enable it for everybody. Apart from that, as a workaround it should work more reliably the less parallel runs are used. |
The machine has 16 cores, I have 15 parallel runs each of which is limited to 1 core. So there's one CPU core as a spare for anything else (BenchExec itself, ssh, htop etc – but nothing CPU intensive).
Yes, it outputs
Thanks for the link! I'll keep this in mind for the next time this happens. |
It happened a third time now, so here's the |
Thanks. I think that 282458, 282459, and 282463 are the BenchExec main process and two helper processes. One is hanging in 282857 should be the problematic process that is dead-locked and needs to be killed in order to let BenchExec continue. It's stack trace starts at From this stack trace my assumption is that something similar like #435 is happening, i.e., some inconsistent lock state in the cloned subprocess and this leads to a dead lock when the subprocess attempts an operation that relies on that lock state. However, because the problematic lock seems to belong to But as far as I understand the code it just does not do that? So this all seems like using libc's The problem is that there seem to be no easy workarounds for this libc bug. We cannot manage mallocs lock state manually, and we cannot easily avoid calling So thank you very much for reporting this problem and providing the stack traces, but I am afraid that fixing it will be complex and take some time. Running BenchExec with less parallel executions and/or a smaller number of tasks should make the bug appear less often. |
Idea: If the child process deadlocks soon enough after being created (before the grand child benchexec/benchexec/containerexecutor.py Line 859 in 9cfa48b
Then we can retry the run again. This might be a useful workaround in practice until we get a full solution. |
Such a fix would be highly appreciated by us at TUM, we still regularly encounter this bug on our 48 core machine! |
When using BenchExec with concurrent execution of runs, we sometimes experience a deadlock in our child process that is forked from the main process. This is due to missing proper handling of clone() in glibc. With this workaround, we check if the child process takes unusually long to create the container (timeout is set to 60s) and if this happens, we assume the deadlock has occurred. Because almost everything related to the container creation happens inside the child process, we can just kill the child process and attempt to start the run (with a new child process) again. This is safe because up to a certain point where we are sure that the child process cannot have started the benchmarked tool already. This is not guaranteed to catch all instances of the deadlock, only those that happen soon enough after clone() in the child. But in my tests this is the case.
So this idea for a workaround was really not difficult to implement, I did so in https://github.com/sosy-lab/benchexec/tree/656-deadlock-workaround and it works in my (small) tests. It waits for 60s, then prints a warning like @sim642 @michael-schwarz It would be really cool if you give this branch a try on some real-world workloads and tell me if it works as expected and actually triggers (look for the warning in the log). It might in theory also be the case that it triggers sometimes, but does not catch all cases, but I think it is likely that if catches most/all of them. |
Great, thank you! I will start a workload with this fix next time we have some free capacity on our large server (should be this week or beginning of next week at the latest)! |
It seems to work! 🎉 For a workload this morning, the warning triggered and execution did not freeze. We are currently running it a lot, so I can keep an eye on our executions for the next few weeks and see if the issue resurfaces or whether this fixes it in all cases. |
This workaround has been tested and found to be working well enough as an intermediate solution: #656 (comment)
Current state: A workaround for this problem was added to BenchExec 3.14. BenchExec will now attempt to detect the deadlock and continue benchmarking after 60s. A full solution can only be implemented as part of a larger restructuring that is tracked in #875. |
The following has happened to me twice now, so I guess it's not totally random.
I've run Goblint on SV-COMP SoftwareSystems-DeviceDriversLinux64-ReachSafety via BenchExec with the following additional options:
What seems to happen is that it just gets stuck at the last task:
Task goblint-all-fast.sv-comp20_prop-reachsafety.SoftwareSystems-DeviceDriversLinux64-ReachSafety (2729/2729)
So far, this has always been at the last task, never earlier, and only on a big set like SoftwareSystems, not on ones like NoDataRace.
This is on Ubuntu 20.04 and BenchExec 3.4-dev (commit 0207553e). I guess I should update that, but are there any changes since that would even influence this?
The text was updated successfully, but these errors were encountered: