Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address data race condition in WeightedChannel #3963

Closed
wants to merge 1 commit into from

Conversation

mindaugasrukas
Copy link
Contributor

What changed?
Introduced thread-safety measures for the "WeightedChannel" struct's "weight" variable using a "sync.RWMutex".

Why?
Tests are flaky and sometimes they fail due to data race condition:

WARNING: DATA RACE
Write at 0x00c002a70480 by goroutine 474:
  go.temporal.io/server/common/tasks.(*WeightedChannel[...]).SetWeight()
      /temporal/common/tasks/weighted_channel.go:59 +0x187
  go.temporal.io/server/common/tasks.(*InterleavedWeightedRoundRobinScheduler[...]).updateChannelWeightLocked()
      /temporal/common/tasks/interleaved_weighted_round_robin.go:385 +0xf6
  go.temporal.io/server/common/tasks.(*InterleavedWeightedRoundRobinScheduler[...]).dispatchTasksWithWeight()
      /temporal/common/tasks/interleaved_weighted_round_robin.go:394 +0x1dd
  go.temporal.io/server/common/tasks.(*InterleavedWeightedRoundRobinScheduler[...]).eventLoop()
      /temporal/common/tasks/interleaved_weighted_round_robin.go:263 +0x324
  go.temporal.io/server/common/tasks.(*InterleavedWeightedRoundRobinScheduler[...]).Start.func1()
      /temporal/common/tasks/interleaved_weighted_round_robin.go:195 +0x47
 
Previous read at 0x00c002a70480 by goroutine 473:
  go.temporal.io/server/common/tasks.(*WeightedChannel[...]).Weight()
      /temporal/common/tasks/weighted_channel.go:55 +0x44
  go.temporal.io/server/common/tasks.(*interleavedWeightedRoundRobinSchedulerSuite).TestUpdateWeight()
      /temporal/common/tasks/interleaved_weighted_round_robin_test.go:389 +0xc64
  runtime.call16()
      /usr/local/go/src/runtime/asm_amd64.s:724 +0x48
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:368 +0xc7
  github.com/stretchr/testify/suite.Run.func1()
      /go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:175 +0x6e6
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
 
Goroutine 474 (running) created at:
  go.temporal.io/server/common/tasks.(*InterleavedWeightedRoundRobinScheduler[...]).Start()
      /temporal/common/tasks/interleaved_weighted_round_robin.go:195 +0x15d
  go.temporal.io/server/common/tasks.(*interleavedWeightedRoundRobinSchedulerSuite).TestUpdateWeight()
      /temporal/common/tasks/interleaved_weighted_round_robin_test.go:345 +0xbc
  runtime.call16()
      /usr/local/go/src/runtime/asm_amd64.s:724 +0x48
  reflect.Value.Call()
      /usr/local/go/src/reflect/value.go:368 +0xc7
  github.com/stretchr/testify/suite.Run.func1()
      /go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:175 +0x6e6
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
 
Goroutine 473 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1493 +0x75d
  github.com/stretchr/testify/suite.runTests()
      /go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:220 +0x198
  github.com/stretchr/testify/suite.Run()
      /go/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:193 +0x9ae
  go.temporal.io/server/common/tasks.TestInterleavedWeightedRoundRobinSchedulerSuite()
      /temporal/common/tasks/interleaved_weighted_round_robin_test.go:68 +0x44
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
==================
2023-02-15T01:35:18.742Z	info	interleaved weighted round robin task scheduler stopped	{"logging-call-at": "interleaved_weighted_round_robin.go:215"}
FAIL: TestInterleavedWeightedRoundRobinSchedulerSuite (0.77s)
    --- FAIL: TestInterleavedWeightedRoundRobinSchedulerSuite/TestUpdateWeight (0.00s)
        interleaved_weighted_round_robin_test.go:391:
            	Error Trace:	/temporal/common/tasks/interleaved_weighted_round_robin_test.go:391
            	Error:      	Not equal:
            	            	expected: []int{8, 8, 8, 8, 5, 8, 5, 8, 5, 8, 5, 8, 5, 1, 1}
            	            	actual  : []int{5, 5, 5, 3, 5, 3, 2, 5, 3, 2, 1}
 
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,16 +1,12 @@
            	            	-([]int) (len=15) {
            	            	- (int) 8,
            	            	- (int) 8,
            	            	- (int) 8,
            	            	- (int) 8,
            	            	+([]int) (len=11) {
            	            	  (int) 5,
            	            	- (int) 8,
            	            	  (int) 5,
            	            	- (int) 8,
            	            	  (int) 5,
            	            	- (int) 8,
            	            	+ (int) 3,
            	            	  (int) 5,
            	            	- (int) 8,
            	            	+ (int) 3,
            	            	+ (int) 2,
            	            	  (int) 5,
            	            	- (int) 1,
            	            	+ (int) 3,
            	            	+ (int) 2,
            	            	  (int) 1
            	Test:       	TestInterleavedWeightedRoundRobinSchedulerSuite/TestUpdateWeight
        testing.go:1319: race detected during execution of test
    testing.go:1319: race detected during execution of test
FAIL

How did you test it?
CI

% go test --race -count=10 -run TestInterleavedWeightedRoundRobinSchedulerSuite ./common/tasks/...  
ok      go.temporal.io/server/common/tasks      5.184s

Potential risks
Unknown

Is hotfix candidate?
Revert

@mindaugasrukas mindaugasrukas requested a review from a team as a code owner February 16, 2023 05:25
Copy link
Member

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see this issue with #3961? I can't repro the data race with that fix.

Also that data race is because we want to inspect the value in the test. I don't think the actual code have this issue, so I am against simply adding a lock as the fix.

@mindaugasrukas
Copy link
Contributor Author

Closing in favor of #3961

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants