thaw and retry again in freezing state #2785

Vanient · 2021-02-03T08:47:10Z

1. just return nil if freeze succeed, retry in any other case.
2. freeze retry always fail in freezing state, but if we thaw it and
   retry again, more likely to succeed, so we'd better thaw and retry
   freeze again.

Signed-off-by: xiadanni [email protected]

1. just return nil if freeze succeed, retry in any other case. 2. freeze retry always fail in freezing state, but if we thaw it and retry again, more likely to succeed, so we'd better thaw and retry freeze again. Signed-off-by: xiadanni <[email protected]>

cyphar · 2021-02-03T11:39:04Z

I'm not sure I follow the reasoning here -- why does adding a thaw increase the possibility of the freeze succeeding? I'm pretty sure any increase in freeze success is caused by the fact you've doubled the number of retries (old code did 1k, now you're doing 2k with a thaw in the middle).

Vanient · 2021-02-03T11:52:55Z

I'm not sure I follow the reasoning here -- why does adding a thaw increase the possibility of the freeze succeeding? I'm pretty sure any increase in freeze success is caused by the fact you've doubled the number of retries (old code did 1k, now you're doing 2k with a thaw in the middle).

If process is creating when setting freeze, freeze could only be succeed after thaw it and retry again. It is useless to freeze thousand times in freezing state without thaw.

dqminh · 2021-02-03T12:16:42Z

@Vanient this looks like a bug in the kernel. Do you have the call trace when this happen ? Something along the line of
Freezing of tasks failed after X seconds with the call trace following after.

We can try to thaw here, but I think there's no guarantee that a single thaw is going to work. If possible, can you try to freeze the workload under cgroup v2 ?

cyphar · 2021-02-03T12:19:24Z

If process is creating when setting freeze, freeze could only be succeed after thaw it and retry again. It is useless to freeze thousand times in freezing state without thaw.

So the solution would be to thaw then re-freeze many times, right? That's not what this patch does. (This is also a cgroupv1-only limitation because the freezer cgroup in cgroupv2 won't try to freeze processes such that they're left in TASK_UNINTERRUPTIBLE or similarly fragile states.)

kolyshkin · 2021-02-03T23:17:05Z

OK, I guess it's time for me to play this game again.

I have 4 reproducers, of which the best (i.e. having the highest chances to fail) is the one doing runc exec in parallel with runc pause/resume. Throwing in an occasional thaw helps indeed!

Here's the test code:

#!/bin/bash
# fuf -- freeze/unfreeze (with exec in parallel)

set -o pipefail

if [ $# -ne 2 ]; then
	echo "Usage: fuf.sh <CID> <time>"
	exit 0
fi

CID=$1
SLEEP=$2
RUNC=../runc

UPLOG=./update-$$.log
rm -f $UPLOG

# Check if CID exists and running
if ! $RUNC exec $CID true; then
	echo "Invalid Container ID: $CID" >&2
	exit 1
fi

KEEP="$(mktemp keep-going.XXXXXXX)"

# Run exec and pause/resume in parallel
(
	t=0
	while [ -f $KEEP ]; do
		$RUNC exec $CID printf .
		((t++))
	done
	sleep 0.2
	printf "\n === exec times total: %4d\n" $t
) &
PID_EXEC=$!

(
	t=0
	f=0
	while [ -f $KEEP ]; do
		if $RUNC pause $CID 2>&1 | tee -a $UPLOG; then
			$RUNC resume $CID && printf .
		else
			printf "F"
			((f++))
		fi
		((t++))
	done
	sleep 0.2
	printf "\n === pause/resume times total: %4d, failed: %d\n" $t $f
) &
PID_FUF=$!

sleep $2
rm -f "$KEEP"
wait $PID_EXEC $PID_FUF

MAX_RETRIES=$(awk '$4 == "after" && $6 == "retries\"" {
		if ($5 > max) {max = $5}
	}
	END {print max}' < $UPLOG)

printf " === frozen after retries:\t %4d\n" $(grep -c 'frozen after ' $UPLOG)
printf " === max retries:\t\t %4d\n" $MAX_RETRIES
printf " === unable to freeze:\t %4d\n" $(grep -c 'unable to freeze' $UPLOG)

rm -f $UPLOG

runc exec $CID true && printf "\ncontainer exec works\n"

and a tiny patch to runc to show the number of retries:

diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 25aff4f8..1e4fdf3f 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -12,6 +12,7 @@ import (
        "github.com/opencontainers/runc/libcontainer/cgroups"
        "github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
        "github.com/opencontainers/runc/libcontainer/configs"
+       "github.com/sirupsen/logrus"
        "golang.org/x/sys/unix"
 )
 
@@ -58,6 +63,9 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
                        case "FREEZING":
                                continue
                        case string(configs.Frozen):
+                               if i > 1 {
+                                       logrus.Infof("frozen after %d retries", i)
+                               }
                                return nil
                        default:
                                // should never happen

This is how a typical short run looks like (no changes to runc except for the above debug patch):

root@ubu2004:/home/kir/git/runc/tst# bash ./fuf.sh sd5 10s
...........time="2021-02-03T23:16:06Z" level=error msg="unable to freeze"
F..time="2021-02-03T23:16:06Z" level=info msg="frozen after 21 retries"
.....time="2021-02-03T23:16:06Z" level=info msg="frozen after 721 retries"
...time="2021-02-03T23:16:06Z" level=info msg="frozen after 17 retries"
................................time="2021-02-03T23:16:07Z" level=info msg="frozen after 22 retries"
........time="2021-02-03T23:16:07Z" level=info msg="frozen after 11 retries"
...time="2021-02-03T23:16:07Z" level=info msg="frozen after 351 retries"
.................time="2021-02-03T23:16:07Z" level=info msg="frozen after 10 retries"
........time="2021-02-03T23:16:08Z" level=info msg="frozen after 15 retries"
...time="2021-02-03T23:16:08Z" level=error msg="unable to freeze"
F..............................time="2021-02-03T23:16:08Z" level=error msg="unable to freeze"
F.........time="2021-02-03T23:16:09Z" level=info msg="frozen after 21 retries"
..........................time="2021-02-03T23:16:09Z" level=info msg="frozen after 530 retries"
.time="2021-02-03T23:16:09Z" level=info msg="frozen after 187 retries"
............time="2021-02-03T23:16:10Z" level=info msg="frozen after 315 retries"
...time="2021-02-03T23:16:10Z" level=info msg="frozen after 21 retries"
...........................................................time="2021-02-03T23:16:11Z" level=error msg="unable to freeze"
.F............time="2021-02-03T23:16:11Z" level=info msg="frozen after 2 retries"
..............time="2021-02-03T23:16:11Z" level=error msg="unable to freeze"
F.....................time="2021-02-03T23:16:12Z" level=info msg="frozen after 8 retries"
...time="2021-02-03T23:16:12Z" level=info msg="frozen after 13 retries"
...time="2021-02-03T23:16:12Z" level=error msg="unable to freeze"
F.............time="2021-02-03T23:16:12Z" level=error msg="unable to freeze"
F........time="2021-02-03T23:16:13Z" level=info msg="frozen after 11 retries"
.................time="2021-02-03T23:16:13Z" level=info msg="frozen after 617 retries"
...time="2021-02-03T23:16:13Z" level=info msg="frozen after 12 retries"
........time="2021-02-03T23:16:13Z" level=error msg="unable to freeze"
.F.......................time="2021-02-03T23:16:14Z" level=error msg="unable to freeze"
F....................................time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F..time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F.....time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
.F...time="2021-02-03T23:16:15Z" level=info msg="frozen after 3 retries"
...time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F...........time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F...........
 === exec times total:  158

 === pause/resume times total:  287, failed: 14
 === frozen after retries:	   20
 === max retries:		  721
 === unable to freeze:	    14

container exec works

kolyshkin · 2021-02-03T23:38:39Z

OK, I played a bit and observed that

adding in a thaw helps
increasing a sleep between writing THAWED and FROZEN helps
adding a sleep in between writing FROZEN does not help

I recommend the following patch:

@@ -38,13 +39,17 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
                // (either via fork/clone or by writing new PIDs to
                // cgroup.procs).
                //
-               // The number of retries below is chosen to have a decent
-               // chance to succeed even in the worst case scenario (runc
-               // pause/unpause with parallel runc exec).
+               // The numbers below are chosen to have a decent chance to
+               // succeed even in the worst case scenario (runc pause/unpause
+               // with parallel runc exec).
                //
                // Adding any amount of sleep in between retries did not
                // increase the chances of successful freeze.
                for i := 0; i < 1000; i++ {
+                       if i % 50 == 49 {
+                               _ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
+                               time.Sleep(10 * time.Millisecond)
+                       }
                        if err := fscommon.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
                                return err
                        }

Here are the results from the above test script running for 120s

before:

 === pause/resume times total: 3198, failed: 80

 === exec times total: 1809
 === frozen after retries:	  244
 === max retries:		  992

after:

 === exec times total: 1638

 === pause/resume times total: 2780, failed: 0
 === frozen after retries:	  298
 === max retries:		  149

As you can see, it slows things down a bit but now it's always succeeds and there's plenty of room left.

kolyshkin · 2021-02-03T23:47:34Z

@Vanient can you check if patch from #2785 (comment) fixes the things for you?

cyphar · 2021-02-04T00:37:09Z

@kolyshkin Yeah I think that patch makes sense -- I think the main case we're hitting is processes being in an unfreezeable state rather than processes joining the cgroup.

Vanient · 2021-02-04T02:26:37Z

@Vanient can you check if patch from #2785 (comment) fixes the things for you?

@kolyshkin Yeah this is what I mean, good testing job. Your patch is better, could you make a new pr

kolyshkin · 2021-02-04T23:47:37Z

Closing this one in favor of #2791

Vanient force-pushed the master branch 2 times, most recently from 09eabe8 to 8db6805 Compare February 3, 2021 09:07

cyphar requested a review from kolyshkin February 3, 2021 11:39

cyphar added the area/cgroupv1 label Feb 4, 2021

kolyshkin mentioned this pull request Feb 4, 2021

cgroupv1 freezer: thaw to increase freeze chances #2791

Merged

kolyshkin closed this Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thaw and retry again in freezing state #2785

thaw and retry again in freezing state #2785

Vanient commented Feb 3, 2021 •

edited

Loading

cyphar commented Feb 3, 2021

Vanient commented Feb 3, 2021

dqminh commented Feb 3, 2021

cyphar commented Feb 3, 2021

kolyshkin commented Feb 3, 2021 •

edited

Loading

kolyshkin commented Feb 3, 2021

kolyshkin commented Feb 3, 2021

cyphar commented Feb 4, 2021

Vanient commented Feb 4, 2021 •

edited

Loading

kolyshkin commented Feb 4, 2021

thaw and retry again in freezing state #2785

thaw and retry again in freezing state #2785

Conversation

Vanient commented Feb 3, 2021 • edited Loading

cyphar commented Feb 3, 2021

Vanient commented Feb 3, 2021

dqminh commented Feb 3, 2021

cyphar commented Feb 3, 2021

kolyshkin commented Feb 3, 2021 • edited Loading

kolyshkin commented Feb 3, 2021

kolyshkin commented Feb 3, 2021

cyphar commented Feb 4, 2021

Vanient commented Feb 4, 2021 • edited Loading

kolyshkin commented Feb 4, 2021

Vanient commented Feb 3, 2021 •

edited

Loading

kolyshkin commented Feb 3, 2021 •

edited

Loading

Vanient commented Feb 4, 2021 •

edited

Loading