Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race between OnDemandHousekeeping and housekeepingTick #2755

Merged
merged 4 commits into from
Dec 30, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion manager/container.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,10 @@ func (cd *containerData) allowErrorLogging() bool {
// periodic housekeeping to reset. This should be used sparingly, as calling OnDemandHousekeeping frequently
// can have serious performance costs.
func (cd *containerData) OnDemandHousekeeping(maxAge time.Duration) {
if cd.clock.Since(cd.statsLastUpdatedTime) > maxAge {
cd.lock.Lock()
timeSinceStatsLastUpdate := cd.clock.Since(cd.statsLastUpdatedTime)
cd.lock.Unlock()
if timeSinceStatsLastUpdate > maxAge {
housekeepingFinishedChan := make(chan struct{})
cd.onDemandChan <- housekeepingFinishedChan
select {
Expand Down Expand Up @@ -555,6 +558,8 @@ func (cd *containerData) housekeepingTick(timer <-chan time.Time, longHousekeepi
klog.V(3).Infof("[%s] Housekeeping took %s", cd.info.Name, duration)
}
cd.notifyOnDemand()
cd.lock.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that call above this line is responsible for synchronization with containerData.OnDemandHousekeeping(). Lock that you are trying to introduce looks redundant to me. Are you able to provide a test case that help race to manifest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a test case, but we were using cadvisor and our race detector tripped up:

WARNING: DATA RACE
Read at 0x00c0001e7a50 by goroutine 30:
  github.com/google/cadvisor/manager.(*containerData).OnDemandHousekeeping()
      vendor/github.com/google/cadvisor/manager/container.go:141 +0x9b
  github.com/google/cadvisor/manager.(*manager).getRequestedContainers.func1()
      vendor/github.com/google/cadvisor/manager/manager.go:741 +0x64

Previous write at 0x00c0001e7a50 by goroutine 130:
  github.com/google/cadvisor/manager.(*containerData).housekeepingTick()
      vendor/github.com/google/cadvisor/manager/container.go:558 +0x34c
  github.com/google/cadvisor/manager.(*containerData).housekeeping()
      vendor/github.com/google/cadvisor/manager/container.go:494 +0x304

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙇. I think we should use single synchronisation mechanism in this case. I wonder if moving notifyOnDemand() call a bit lower and applying similar change to OnDemandHousekeeping() would not solve the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking. I'm not quite sure that I follow what you mean by single synchronization. OnDemandHousekeeping is an external API so anyone can call it, and it can potentially race with housekeepingTick when both try and access statsLastUpdatedTime. Access to this variable should be guarded by some mutual exclusion - right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iwankgb Can you take a look at this again? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkagrawal I don't think I will be able to look at this issue again before coming weekend.

defer cd.lock.Unlock()
cd.statsLastUpdatedTime = cd.clock.Now()
return true
}
Expand Down
31 changes: 31 additions & 0 deletions manager/container_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -319,3 +319,34 @@ func TestOnDemandHousekeepingReturnsAfterStopped(t *testing.T) {

mockHandler.AssertExpectations(t)
}

func TestOnDemandHousekeepingRace(t *testing.T) {
statsList := itest.GenerateRandomStats(1, 4, 1*time.Second)
stats := statsList[0]

cd, mockHandler, _, _ := newTestContainerData(t)
mockHandler.On("GetStats").Return(stats, nil)

wg := sync.WaitGroup{}
wg.Add(1002)

go func() {
time.Sleep(10 * time.Millisecond)
err := cd.Start()
assert.NoError(t, err)
wg.Done()
}()

go func() {
t.Log("starting on demand goroutine")
for i := 0; i < 1000; i++ {
go func() {
time.Sleep(1 * time.Microsecond)
cd.OnDemandHousekeeping(0 * time.Millisecond)
wg.Done()
}()
}
wg.Done()
}()
wg.Wait()
}