Fix mem leak in newHcsTask() #2354

kiashok · 2025-01-15T23:32:21Z

newHcsTask() starts a goroutine for hyperv containers to enable hcsTask and hcsExec
resource cleanup as expected. But when containers exit before the uvm in the normal
case, this goroutine is leaked as it is waiting on a block wait on the UVM.
This commit ensures that we wait for container exit as well so that we do not leak the routine.

Testing/Validation:

Validated via debugger that we are terminating the goroutine and also freeing
up resources.
Tested with multiple container recreates on the same UVM.

Note: This is an interim fix. The cleaner approach needs more code changes in well known code paths and more validation to ensure we are not breaking any existing flow. Since timeline to get this fix in is sensitive, proposing this fix. We will be revisiting this again.

cmd/containerd-shim-runhcs-v1/task_hcs.go

newHcsTask() starts a goroutine for hyperv containers to enable hcsTask and hcsExec resource cleanup as expected. But when containers exit before the uvm in the normal case, this goroutine is leaked as it is waiting on a block wait on the UVM. This commit ensures that we wait for container exit as well so that we do not leak the routine. Signed-off-by: Kirtana Ashok <[email protected]>

helsaawy · 2025-01-16T19:43:00Z

internal/uvm/wait.go

+// WaitForUvmOrContainerExit waits for the container `c` or its UVM
+// to exit. This is used to clean up hcs task and exec resources by
+// the caller.
+func (uvm *UtilityVM) WaitForUvmOrContainerExit(ctx context.Context, c cow.Container) (err error) {
+	select {
+	case <-c.WaitChannel():
+		return c.WaitError()
+	case <-uvm.hcsSystem.WaitChannel():
+		logrus.WithField(logfields.UVMID, uvm.id).Debug("uvm exited, waiting for output processing to complete")
+		var outputErr error
+		if uvm.outputProcessingDone != nil {
+			select {
+			case <-uvm.outputProcessingDone:
+			case <-ctx.Done():
+				outputErr = fmt.Errorf("failed to wait on uvm output processing: %w", ctx.Err())
+			}
+		}
+		return errors.Join(uvm.hcsSystem.WaitError(), outputErr)
+	case <-ctx.Done():
+		return ctx.Err()
+	}
+}


if feels weird to couple the two together, since this duplicates a lot of logic thats in WaitCtx, and this is a very specialized case of waiting on two different channels with a context timeout.

maybe you can add Done() and WaitError functions to the uVM (similar to what we do for containers) that abstracts away the wait logic.

That way waitForHostOrContainerExit can select on them, and we can avoid the current differentiation in the logic if the timeout happens during the wait on the compute system or the output processing

// // internal\uvm\types.go // type UtilityVM struct { // ... waitErr error waitCh chan struct{} exitErr error exitCh chan struct{} // ... } // // internal\uvm\start.go // func (uvm *UtilityVM) Start(ctx context.Context) (err error) { // ... // Start waiting on the utility VM. go func() { // the original context may have timeout or propagate a cancellation // copy the original to prevent it affecting the background wait go routine cCtx := context.WithoutCancel(pCtx) uvm.waitErr := uvm.hcsSystem.WaitCtx(cCtx) if uvm.waitErr == nil { uvm.exitErr = uvm.hcsSystem.ExitError() } else { uvm.exitErr = uvm.waitErr } close(uvm.exitCh) if uvm.outputProcessingDone != nil { logrus.WithField(logfields.UVMID, uvm.id).Debug("uvm exited, waiting for output processing to complete") <-uvm.outputProcessingDone: } close(uvm.waitCh) }() // ... } // // internal\uvm\wait.go // func (uvm *UtilityVM) Done() <-chan struct{} { return uvm.waitCh } func (uvm *UtilityVM) WaitError() { return uvm.waitErr } func (uvm *UtilityVM) WaitCtx(ctx context.Context) (err error) { select { case <-uvm.waitCh: err = uvm.waitErr case <-ctx.Done(): err = fmt.Errorf("failed to wait on uvm: %w", ctx.Err()) } } return err } // // cmd/containerd-shim-runhcs-v1/task_hcs.go // func (ht *hcsTask) waitForHostOrContainerExit() { // ... var err error select { case <- ht.host.Done(): err = ht.host.WaitError() case <- ht.c.WaitChannel() err = ht.c.WaitError() } // ... }

This is the long term fix in mind. There are callers to uvm.Wait() and uvm.WaitCtx() that might be expecting outputProcessDone (that is IO to complete along with hcsSystem termination too) and we need to account for them as well before removing from uvm.WaitCtx(). Additionally, hcsshim lifecycle management could be looked into and optimized to make things easier as well. This needs additional validation and careful consideration. Since there is an urgency to fix this leak, wanted to lean more towards a quicker least risky fix to ensure no additional regressions and revisit on a cleaner long term fix.

ambarve · 2025-01-16T21:20:03Z

Can we also add (or update) at least two functional tests that start a container in the UVM and exits cleanly and the other crashes the UVM? I think we could use something like https://github.com/uber-go/goleak to verify if we are leaking any goroutines.

kiashok · 2025-01-22T20:49:48Z

We will be revisiting this leak with the longer time fix as this affects only hyperv wcow/lcow and it is not being used in upstream currently.

kiashok requested a review from a team as a code owner January 15, 2025 23:32

anmaxvl reviewed Jan 16, 2025

View reviewed changes

cmd/containerd-shim-runhcs-v1/task_hcs.go Outdated Show resolved Hide resolved

kiashok force-pushed the fix-mem-leak branch from fce2be6 to e08d071 Compare January 16, 2025 15:43

helsaawy reviewed Jan 16, 2025

View reviewed changes

msscotb assigned ambarve and helsaawy Jan 22, 2025

kiashok closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mem leak in newHcsTask() #2354

Fix mem leak in newHcsTask() #2354

kiashok commented Jan 15, 2025 •

edited

Loading

helsaawy Jan 16, 2025

kiashok Jan 17, 2025

ambarve commented Jan 16, 2025

kiashok commented Jan 22, 2025

Fix mem leak in newHcsTask() #2354

Fix mem leak in newHcsTask() #2354

Conversation

kiashok commented Jan 15, 2025 • edited Loading

helsaawy Jan 16, 2025

Choose a reason for hiding this comment

kiashok Jan 17, 2025

Choose a reason for hiding this comment

ambarve commented Jan 16, 2025

kiashok commented Jan 22, 2025

kiashok commented Jan 15, 2025 •

edited

Loading