-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thaw and retry again in freezing state #2785
Conversation
09eabe8
to
8db6805
Compare
1. just return nil if freeze succeed, retry in any other case. 2. freeze retry always fail in freezing state, but if we thaw it and retry again, more likely to succeed, so we'd better thaw and retry freeze again. Signed-off-by: xiadanni <[email protected]>
I'm not sure I follow the reasoning here -- why does adding a thaw increase the possibility of the freeze succeeding? I'm pretty sure any increase in freeze success is caused by the fact you've doubled the number of retries (old code did 1k, now you're doing 2k with a thaw in the middle). |
@Vanient this looks like a bug in the kernel. Do you have the call trace when this happen ? Something along the line of We can try to thaw here, but I think there's no guarantee that a single thaw is going to work. If possible, can you try to freeze the workload under cgroup v2 ? |
So the solution would be to thaw then re-freeze many times, right? That's not what this patch does. (This is also a cgroupv1-only limitation because the freezer cgroup in cgroupv2 won't try to freeze processes such that they're left in |
OK, I guess it's time for me to play this game again. I have 4 reproducers, of which the best (i.e. having the highest chances to fail) is the one doing runc exec in parallel with runc pause/resume. Throwing in an occasional thaw helps indeed! Here's the test code: #!/bin/bash
# fuf -- freeze/unfreeze (with exec in parallel)
set -o pipefail
if [ $# -ne 2 ]; then
echo "Usage: fuf.sh <CID> <time>"
exit 0
fi
CID=$1
SLEEP=$2
RUNC=../runc
UPLOG=./update-$$.log
rm -f $UPLOG
# Check if CID exists and running
if ! $RUNC exec $CID true; then
echo "Invalid Container ID: $CID" >&2
exit 1
fi
KEEP="$(mktemp keep-going.XXXXXXX)"
# Run exec and pause/resume in parallel
(
t=0
while [ -f $KEEP ]; do
$RUNC exec $CID printf .
((t++))
done
sleep 0.2
printf "\n === exec times total: %4d\n" $t
) &
PID_EXEC=$!
(
t=0
f=0
while [ -f $KEEP ]; do
if $RUNC pause $CID 2>&1 | tee -a $UPLOG; then
$RUNC resume $CID && printf .
else
printf "F"
((f++))
fi
((t++))
done
sleep 0.2
printf "\n === pause/resume times total: %4d, failed: %d\n" $t $f
) &
PID_FUF=$!
sleep $2
rm -f "$KEEP"
wait $PID_EXEC $PID_FUF
MAX_RETRIES=$(awk '$4 == "after" && $6 == "retries\"" {
if ($5 > max) {max = $5}
}
END {print max}' < $UPLOG)
printf " === frozen after retries:\t %4d\n" $(grep -c 'frozen after ' $UPLOG)
printf " === max retries:\t\t %4d\n" $MAX_RETRIES
printf " === unable to freeze:\t %4d\n" $(grep -c 'unable to freeze' $UPLOG)
rm -f $UPLOG
runc exec $CID true && printf "\ncontainer exec works\n" and a tiny patch to runc to show the number of retries: diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 25aff4f8..1e4fdf3f 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -12,6 +12,7 @@ import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
+ "github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
@@ -58,6 +63,9 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
case "FREEZING":
continue
case string(configs.Frozen):
+ if i > 1 {
+ logrus.Infof("frozen after %d retries", i)
+ }
return nil
default:
// should never happen This is how a typical short run looks like (no changes to runc except for the above debug patch): root@ubu2004:/home/kir/git/runc/tst# bash ./fuf.sh sd5 10s
...........time="2021-02-03T23:16:06Z" level=error msg="unable to freeze"
F..time="2021-02-03T23:16:06Z" level=info msg="frozen after 21 retries"
.....time="2021-02-03T23:16:06Z" level=info msg="frozen after 721 retries"
...time="2021-02-03T23:16:06Z" level=info msg="frozen after 17 retries"
................................time="2021-02-03T23:16:07Z" level=info msg="frozen after 22 retries"
........time="2021-02-03T23:16:07Z" level=info msg="frozen after 11 retries"
...time="2021-02-03T23:16:07Z" level=info msg="frozen after 351 retries"
.................time="2021-02-03T23:16:07Z" level=info msg="frozen after 10 retries"
........time="2021-02-03T23:16:08Z" level=info msg="frozen after 15 retries"
...time="2021-02-03T23:16:08Z" level=error msg="unable to freeze"
F..............................time="2021-02-03T23:16:08Z" level=error msg="unable to freeze"
F.........time="2021-02-03T23:16:09Z" level=info msg="frozen after 21 retries"
..........................time="2021-02-03T23:16:09Z" level=info msg="frozen after 530 retries"
.time="2021-02-03T23:16:09Z" level=info msg="frozen after 187 retries"
............time="2021-02-03T23:16:10Z" level=info msg="frozen after 315 retries"
...time="2021-02-03T23:16:10Z" level=info msg="frozen after 21 retries"
...........................................................time="2021-02-03T23:16:11Z" level=error msg="unable to freeze"
.F............time="2021-02-03T23:16:11Z" level=info msg="frozen after 2 retries"
..............time="2021-02-03T23:16:11Z" level=error msg="unable to freeze"
F.....................time="2021-02-03T23:16:12Z" level=info msg="frozen after 8 retries"
...time="2021-02-03T23:16:12Z" level=info msg="frozen after 13 retries"
...time="2021-02-03T23:16:12Z" level=error msg="unable to freeze"
F.............time="2021-02-03T23:16:12Z" level=error msg="unable to freeze"
F........time="2021-02-03T23:16:13Z" level=info msg="frozen after 11 retries"
.................time="2021-02-03T23:16:13Z" level=info msg="frozen after 617 retries"
...time="2021-02-03T23:16:13Z" level=info msg="frozen after 12 retries"
........time="2021-02-03T23:16:13Z" level=error msg="unable to freeze"
.F.......................time="2021-02-03T23:16:14Z" level=error msg="unable to freeze"
F....................................time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F..time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F.....time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
.F...time="2021-02-03T23:16:15Z" level=info msg="frozen after 3 retries"
...time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F...........time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F...........
=== exec times total: 158
=== pause/resume times total: 287, failed: 14
=== frozen after retries: 20
=== max retries: 721
=== unable to freeze: 14
container exec works |
OK, I played a bit and observed that
I recommend the following patch: @@ -38,13 +39,17 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
// (either via fork/clone or by writing new PIDs to
// cgroup.procs).
//
- // The number of retries below is chosen to have a decent
- // chance to succeed even in the worst case scenario (runc
- // pause/unpause with parallel runc exec).
+ // The numbers below are chosen to have a decent chance to
+ // succeed even in the worst case scenario (runc pause/unpause
+ // with parallel runc exec).
//
// Adding any amount of sleep in between retries did not
// increase the chances of successful freeze.
for i := 0; i < 1000; i++ {
+ if i % 50 == 49 {
+ _ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
+ time.Sleep(10 * time.Millisecond)
+ }
if err := fscommon.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
return err
} Here are the results from the above test script running for 120s before:
after:
As you can see, it slows things down a bit but now it's always succeeds and there's plenty of room left. |
@Vanient can you check if patch from #2785 (comment) fixes the things for you? |
@kolyshkin Yeah I think that patch makes sense -- I think the main case we're hitting is processes being in an unfreezeable state rather than processes joining the cgroup. |
@kolyshkin Yeah this is what I mean, good testing job. Your patch is better, could you make a new pr |
Closing this one in favor of #2791 |
Signed-off-by: xiadanni [email protected]