Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mem leak in newHcsTask() #2354

Closed
wants to merge 1 commit into from
Closed

Conversation

kiashok
Copy link
Contributor

@kiashok kiashok commented Jan 15, 2025

newHcsTask() starts a goroutine for hyperv containers to enable hcsTask and hcsExec
resource cleanup as expected. But when containers exit before the uvm in the normal
case, this goroutine is leaked as it is waiting on a block wait on the UVM.
This commit ensures that we wait for container exit as well so that we do not leak the routine.

Testing/Validation:

  • Validated via debugger that we are terminating the goroutine and also freeing
    up resources.
  • Tested with multiple container recreates on the same UVM.

Note: This is an interim fix. The cleaner approach needs more code changes in well known code paths and more validation to ensure we are not breaking any existing flow. Since timeline to get this fix in is sensitive, proposing this fix. We will be revisiting this again.

@kiashok kiashok requested a review from a team as a code owner January 15, 2025 23:32
newHcsTask() starts a goroutine for hyperv
containers to enable hcsTask and hcsExec
resource cleanup as expected. But when
containers exit before the uvm in the
normal case, this goroutine is leaked
as it is waiting on a block wait on the UVM.
This commit ensures that we wait for container
exit as well so that we do not leak the routine.

Signed-off-by: Kirtana Ashok <[email protected]>
Comment on lines +16 to +37
// WaitForUvmOrContainerExit waits for the container `c` or its UVM
// to exit. This is used to clean up hcs task and exec resources by
// the caller.
func (uvm *UtilityVM) WaitForUvmOrContainerExit(ctx context.Context, c cow.Container) (err error) {
select {
case <-c.WaitChannel():
return c.WaitError()
case <-uvm.hcsSystem.WaitChannel():
logrus.WithField(logfields.UVMID, uvm.id).Debug("uvm exited, waiting for output processing to complete")
var outputErr error
if uvm.outputProcessingDone != nil {
select {
case <-uvm.outputProcessingDone:
case <-ctx.Done():
outputErr = fmt.Errorf("failed to wait on uvm output processing: %w", ctx.Err())
}
}
return errors.Join(uvm.hcsSystem.WaitError(), outputErr)
case <-ctx.Done():
return ctx.Err()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if feels weird to couple the two together, since this duplicates a lot of logic thats in WaitCtx, and this is a very specialized case of waiting on two different channels with a context timeout.

maybe you can add Done() and WaitError functions to the uVM (similar to what we do for containers) that abstracts away the wait logic.

That way waitForHostOrContainerExit can select on them, and we can avoid the current differentiation in the logic if the timeout happens during the wait on the compute system or the output processing

//
// internal\uvm\types.go
//

type UtilityVM struct {
    // ...

	waitErr error
	waitCh  chan struct{}

	exitErr error
	exitCh  chan struct{}

    // ...
}

//
// internal\uvm\start.go
//

func (uvm *UtilityVM) Start(ctx context.Context) (err error) {
    // ...

	// Start waiting on the utility VM.
	go func() {
		// the original context may have timeout or propagate a cancellation
		// copy the original to prevent it affecting the background wait go routine
		cCtx := context.WithoutCancel(pCtx)
		uvm.waitErr := uvm.hcsSystem.WaitCtx(cCtx)

		if uvm.waitErr == nil {
			uvm.exitErr = uvm.hcsSystem.ExitError()
		} else {
			uvm.exitErr = uvm.waitErr
		}
		close(uvm.exitCh)

		if uvm.outputProcessingDone != nil {
			logrus.WithField(logfields.UVMID, uvm.id).Debug("uvm exited, waiting for output processing to complete")
			<-uvm.outputProcessingDone:
		}
		close(uvm.waitCh)
	}()

    // ...
}

//
// internal\uvm\wait.go
//

func (uvm *UtilityVM) Done() <-chan struct{} { return uvm.waitCh }

func (uvm *UtilityVM) WaitError() { return uvm.waitErr }

func (uvm *UtilityVM) WaitCtx(ctx context.Context) (err error) {
	select {
		case <-uvm.waitCh:
			err = uvm.waitErr
		case <-ctx.Done():
			err = fmt.Errorf("failed to wait on uvm: %w", ctx.Err())
		}
	}

	return err
}

//
// cmd/containerd-shim-runhcs-v1/task_hcs.go
//

func (ht *hcsTask) waitForHostOrContainerExit() {
	// ...

	var err error
	select {
	case <- ht.host.Done():
		err = ht.host.WaitError()
	case <- ht.c.WaitChannel()
		err = ht.c.WaitError()
	}

	// ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the long term fix in mind. There are callers to uvm.Wait() and uvm.WaitCtx() that might be expecting outputProcessDone (that is IO to complete along with hcsSystem termination too) and we need to account for them as well before removing from uvm.WaitCtx(). Additionally, hcsshim lifecycle management could be looked into and optimized to make things easier as well. This needs additional validation and careful consideration. Since there is an urgency to fix this leak, wanted to lean more towards a quicker least risky fix to ensure no additional regressions and revisit on a cleaner long term fix.

@ambarve
Copy link
Contributor

ambarve commented Jan 16, 2025

Can we also add (or update) at least two functional tests that start a container in the UVM and exits cleanly and the other crashes the UVM? I think we could use something like https://github.com/uber-go/goleak to verify if we are leaking any goroutines.

@kiashok
Copy link
Contributor Author

kiashok commented Jan 22, 2025

We will be revisiting this leak with the longer time fix as this affects only hyperv wcow/lcow and it is not being used in upstream currently.

@kiashok kiashok closed this Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants