Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thaw and retry again in freezing state #2785

Closed
wants to merge 1 commit into from

Conversation

Vanient
Copy link
Contributor

@Vanient Vanient commented Feb 3, 2021

1. just return nil if freeze succeed, retry in any other case.
2. freeze retry always fail in freezing state, but if we thaw it and
   retry again, more likely to succeed, so we'd better thaw and retry
   freeze again.

Signed-off-by: xiadanni [email protected]

@Vanient Vanient force-pushed the master branch 2 times, most recently from 09eabe8 to 8db6805 Compare February 3, 2021 09:07
1. just return nil if freeze succeed, retry in any other case.
2. freeze retry always fail in freezing state, but if we thaw it and
   retry again, more likely to succeed, so we'd better thaw and retry
   freeze again.

Signed-off-by: xiadanni <[email protected]>
@cyphar
Copy link
Member

cyphar commented Feb 3, 2021

I'm not sure I follow the reasoning here -- why does adding a thaw increase the possibility of the freeze succeeding? I'm pretty sure any increase in freeze success is caused by the fact you've doubled the number of retries (old code did 1k, now you're doing 2k with a thaw in the middle).

@cyphar cyphar requested a review from kolyshkin February 3, 2021 11:39
@Vanient
Copy link
Contributor Author

Vanient commented Feb 3, 2021

I'm not sure I follow the reasoning here -- why does adding a thaw increase the possibility of the freeze succeeding? I'm pretty sure any increase in freeze success is caused by the fact you've doubled the number of retries (old code did 1k, now you're doing 2k with a thaw in the middle).

image
If process is creating when setting freeze, freeze could only be succeed after thaw it and retry again. It is useless to freeze thousand times in freezing state without thaw.

@dqminh
Copy link
Contributor

dqminh commented Feb 3, 2021

@Vanient this looks like a bug in the kernel. Do you have the call trace when this happen ? Something along the line of
Freezing of tasks failed after X seconds with the call trace following after.

We can try to thaw here, but I think there's no guarantee that a single thaw is going to work. If possible, can you try to freeze the workload under cgroup v2 ?

@cyphar
Copy link
Member

cyphar commented Feb 3, 2021

If process is creating when setting freeze, freeze could only be succeed after thaw it and retry again. It is useless to freeze thousand times in freezing state without thaw.

So the solution would be to thaw then re-freeze many times, right? That's not what this patch does. (This is also a cgroupv1-only limitation because the freezer cgroup in cgroupv2 won't try to freeze processes such that they're left in TASK_UNINTERRUPTIBLE or similarly fragile states.)

@kolyshkin
Copy link
Contributor

kolyshkin commented Feb 3, 2021

OK, I guess it's time for me to play this game again.

I have 4 reproducers, of which the best (i.e. having the highest chances to fail) is the one doing runc exec in parallel with runc pause/resume. Throwing in an occasional thaw helps indeed!

Here's the test code:

#!/bin/bash
# fuf -- freeze/unfreeze (with exec in parallel)

set -o pipefail

if [ $# -ne 2 ]; then
	echo "Usage: fuf.sh <CID> <time>"
	exit 0
fi

CID=$1
SLEEP=$2
RUNC=../runc

UPLOG=./update-$$.log
rm -f $UPLOG

# Check if CID exists and running
if ! $RUNC exec $CID true; then
	echo "Invalid Container ID: $CID" >&2
	exit 1
fi

KEEP="$(mktemp keep-going.XXXXXXX)"

# Run exec and pause/resume in parallel
(
	t=0
	while [ -f $KEEP ]; do
		$RUNC exec $CID printf .
		((t++))
	done
	sleep 0.2
	printf "\n === exec times total: %4d\n" $t
) &
PID_EXEC=$!

(
	t=0
	f=0
	while [ -f $KEEP ]; do
		if $RUNC pause $CID 2>&1 | tee -a $UPLOG; then
			$RUNC resume $CID && printf .
		else
			printf "F"
			((f++))
		fi
		((t++))
	done
	sleep 0.2
	printf "\n === pause/resume times total: %4d, failed: %d\n" $t $f
) &
PID_FUF=$!

sleep $2
rm -f "$KEEP"
wait $PID_EXEC $PID_FUF

MAX_RETRIES=$(awk '$4 == "after" && $6 == "retries\"" {
		if ($5 > max) {max = $5}
	}
	END {print max}' < $UPLOG)

printf " === frozen after retries:\t %4d\n" $(grep -c 'frozen after ' $UPLOG)
printf " === max retries:\t\t %4d\n" $MAX_RETRIES
printf " === unable to freeze:\t %4d\n" $(grep -c 'unable to freeze' $UPLOG)

rm -f $UPLOG

runc exec $CID true && printf "\ncontainer exec works\n"

and a tiny patch to runc to show the number of retries:

diff --git a/libcontainer/cgroups/fs/freezer.go b/libcontainer/cgroups/fs/freezer.go
index 25aff4f8..1e4fdf3f 100644
--- a/libcontainer/cgroups/fs/freezer.go
+++ b/libcontainer/cgroups/fs/freezer.go
@@ -12,6 +12,7 @@ import (
        "github.com/opencontainers/runc/libcontainer/cgroups"
        "github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
        "github.com/opencontainers/runc/libcontainer/configs"
+       "github.com/sirupsen/logrus"
        "golang.org/x/sys/unix"
 )
 
@@ -58,6 +63,9 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
                        case "FREEZING":
                                continue
                        case string(configs.Frozen):
+                               if i > 1 {
+                                       logrus.Infof("frozen after %d retries", i)
+                               }
                                return nil
                        default:
                                // should never happen

This is how a typical short run looks like (no changes to runc except for the above debug patch):

root@ubu2004:/home/kir/git/runc/tst# bash ./fuf.sh sd5 10s
...........time="2021-02-03T23:16:06Z" level=error msg="unable to freeze"
F..time="2021-02-03T23:16:06Z" level=info msg="frozen after 21 retries"
.....time="2021-02-03T23:16:06Z" level=info msg="frozen after 721 retries"
...time="2021-02-03T23:16:06Z" level=info msg="frozen after 17 retries"
................................time="2021-02-03T23:16:07Z" level=info msg="frozen after 22 retries"
........time="2021-02-03T23:16:07Z" level=info msg="frozen after 11 retries"
...time="2021-02-03T23:16:07Z" level=info msg="frozen after 351 retries"
.................time="2021-02-03T23:16:07Z" level=info msg="frozen after 10 retries"
........time="2021-02-03T23:16:08Z" level=info msg="frozen after 15 retries"
...time="2021-02-03T23:16:08Z" level=error msg="unable to freeze"
F..............................time="2021-02-03T23:16:08Z" level=error msg="unable to freeze"
F.........time="2021-02-03T23:16:09Z" level=info msg="frozen after 21 retries"
..........................time="2021-02-03T23:16:09Z" level=info msg="frozen after 530 retries"
.time="2021-02-03T23:16:09Z" level=info msg="frozen after 187 retries"
............time="2021-02-03T23:16:10Z" level=info msg="frozen after 315 retries"
...time="2021-02-03T23:16:10Z" level=info msg="frozen after 21 retries"
...........................................................time="2021-02-03T23:16:11Z" level=error msg="unable to freeze"
.F............time="2021-02-03T23:16:11Z" level=info msg="frozen after 2 retries"
..............time="2021-02-03T23:16:11Z" level=error msg="unable to freeze"
F.....................time="2021-02-03T23:16:12Z" level=info msg="frozen after 8 retries"
...time="2021-02-03T23:16:12Z" level=info msg="frozen after 13 retries"
...time="2021-02-03T23:16:12Z" level=error msg="unable to freeze"
F.............time="2021-02-03T23:16:12Z" level=error msg="unable to freeze"
F........time="2021-02-03T23:16:13Z" level=info msg="frozen after 11 retries"
.................time="2021-02-03T23:16:13Z" level=info msg="frozen after 617 retries"
...time="2021-02-03T23:16:13Z" level=info msg="frozen after 12 retries"
........time="2021-02-03T23:16:13Z" level=error msg="unable to freeze"
.F.......................time="2021-02-03T23:16:14Z" level=error msg="unable to freeze"
F....................................time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F..time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F.....time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
.F...time="2021-02-03T23:16:15Z" level=info msg="frozen after 3 retries"
...time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F...........time="2021-02-03T23:16:15Z" level=error msg="unable to freeze"
F...........
 === exec times total:  158

 === pause/resume times total:  287, failed: 14
 === frozen after retries:	   20
 === max retries:		  721
 === unable to freeze:	    14

container exec works

@kolyshkin
Copy link
Contributor

OK, I played a bit and observed that

  • adding in a thaw helps
  • increasing a sleep between writing THAWED and FROZEN helps
  • adding a sleep in between writing FROZEN does not help

I recommend the following patch:

@@ -38,13 +39,17 @@ func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
                // (either via fork/clone or by writing new PIDs to
                // cgroup.procs).
                //
-               // The number of retries below is chosen to have a decent
-               // chance to succeed even in the worst case scenario (runc
-               // pause/unpause with parallel runc exec).
+               // The numbers below are chosen to have a decent chance to
+               // succeed even in the worst case scenario (runc pause/unpause
+               // with parallel runc exec).
                //
                // Adding any amount of sleep in between retries did not
                // increase the chances of successful freeze.
                for i := 0; i < 1000; i++ {
+                       if i % 50 == 49 {
+                               _ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
+                               time.Sleep(10 * time.Millisecond)
+                       }
                        if err := fscommon.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
                                return err
                        }

Here are the results from the above test script running for 120s

before:

 === pause/resume times total: 3198, failed: 80

 === exec times total: 1809
 === frozen after retries:	  244
 === max retries:		  992

after:

 === exec times total: 1638

 === pause/resume times total: 2780, failed: 0
 === frozen after retries:	  298
 === max retries:		  149

As you can see, it slows things down a bit but now it's always succeeds and there's plenty of room left.

@kolyshkin
Copy link
Contributor

@Vanient can you check if patch from #2785 (comment) fixes the things for you?

@cyphar
Copy link
Member

cyphar commented Feb 4, 2021

@kolyshkin Yeah I think that patch makes sense -- I think the main case we're hitting is processes being in an unfreezeable state rather than processes joining the cgroup.

@Vanient
Copy link
Contributor Author

Vanient commented Feb 4, 2021

@Vanient can you check if patch from #2785 (comment) fixes the things for you?

@kolyshkin Yeah this is what I mean, good testing job. Your patch is better, could you make a new pr

@kolyshkin
Copy link
Contributor

Closing this one in favor of #2791

@kolyshkin kolyshkin closed this Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants