-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: let sysmon sleep in netpoll if possible #51651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Currently, the scheduler picks up next to-run g in the following delicate order: 1 locked g. 2 trace reader g. 3 GC backgroud mark worker g. 4 global runq if schedtick%61 == 0. 5 local runq including timers. 6 findrunnable loop. 6.1 finalizer g. 6.2 local runq including timers. 6.3 global runq, no schedtick checked. 6.4 non-blocking netpoll(0) if netpollWaiters > 0 and no other M polling. 6.5 steal any timers and runnable g from other P. spinning state. 6.6 GC idle mark worker g. 6.7 recheck all other P's local runq, idle GC marker and timers. non-spinning state, P is idle. 6.8 blocking netpoll(delay) until next earliest timer or IO ready. if any other M already polling, no blocking netpoll. M parked. If found a runnable g in any step, all steps following it will be skipped. Suppose there are many runnable gs, scheduler would almost find a g before trying findrunnable, so netpoll left untouched. Normal Ms have no opportunity to unpark netpoll waiters g even though IO readied. The special M sysmon polls network if not polled for more than 10ms, which is the only opportunity to ready any netpoll waiters g into the global runq. This causes a false positive timeout problem. I did a simple test to demostrate this problem. Dialing a localhost TCP listening port costs about 10ms. This is obviously not credible. See file runtime/sysmon_test.go for detail test code. Normal Ms will sleep in netpoll waiting for timer and IO ready if no g found at last. The special M sysmon simply sleeps 20us~10ms and does a non-blocking netpoll if needed. This CL let sysmon sleep in netpoll, do a blocking netpoll like what normal Ms do. So netpoll waiters g will be readied into runq as soon as IO readied. With this change, the above mentioned test shows that dialing to localhost TCP listening port costs mostly less than 1ms. Its much more reasonable than 10ms. Detailed test results for the old and new versions are as follows: $ ../../bin/go test -v -run TestSysmonReadyNetpollWaitersASAP === RUN TestSysmonReadyNetpollWaitersASAP sysmon_test.go:105: dialed 85 times within 1.000029211s sysmon_test.go:106: timeBucket count percent sysmon_test.go:108: [ 0, 1)ms 1 1.18% sysmon_test.go:108: [ 1, 2)ms 0 0.00% sysmon_test.go:108: [ 2, 3)ms 0 0.00% sysmon_test.go:108: [ 3, 4)ms 0 0.00% sysmon_test.go:108: [ 4, 5)ms 0 0.00% sysmon_test.go:108: [ 5, 6)ms 0 0.00% sysmon_test.go:108: [ 6, 7)ms 0 0.00% sysmon_test.go:108: [ 7, 8)ms 0 0.00% sysmon_test.go:108: [ 8, 9)ms 1 1.18% sysmon_test.go:108: [ 9,10)ms 2 2.35% sysmon_test.go:108: [10,11)ms 32 37.65% sysmon_test.go:108: [11,12)ms 49 57.65% --- FAIL: TestSysmonReadyNetpollWaitersASAP (1.11s) $ ../../bin/go test -v -run TestSysmonReadyNetpollWaitersASAP === RUN TestSysmonReadyNetpollWaitersASAP sysmon_test.go:105: dialed 2369 times within 1.000021368s sysmon_test.go:106: timeBucket count percent sysmon_test.go:108: [ 0, 1)ms 2209 93.25% sysmon_test.go:108: [ 1, 2)ms 13 0.55% sysmon_test.go:108: [ 2, 3)ms 33 1.39% sysmon_test.go:108: [ 3, 4)ms 68 2.87% sysmon_test.go:108: [ 4, 5)ms 8 0.34% sysmon_test.go:108: [ 5, 6)ms 9 0.38% sysmon_test.go:108: [ 6, 7)ms 6 0.25% sysmon_test.go:108: [ 7, 8)ms 4 0.17% sysmon_test.go:108: [ 8, 9)ms 7 0.30% sysmon_test.go:108: [ 9,10)ms 4 0.17% sysmon_test.go:108: [10,11)ms 5 0.21% sysmon_test.go:108: [11,12)ms 3 0.13% --- PASS: TestSysmonReadyNetpollWaitersASAP (1.11s)
This PR (HEAD: d38fc4e) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/392314 to see it. Tip: You can toggle comments from me using the |
Message from Michael Pratt: Patch Set 1: Run-TryBot+1 Trust+1 (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 1: TryBot-Result-1 (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Michael Pratt: Patch Set 1: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
skip sysmon test on wasm. high system load may starve sysmon thead, skip test.
This PR (HEAD: 38aaf26) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/392314 to see it. Tip: You can toggle comments from me using the |
Message from hzzb: Patch Set 2: (3 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Michael Pratt: Patch Set 2: Run-TryBot+1 Trust+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 2: TryBot-Result-1 (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
android emulator may be slow. last time bucket looks misleading.
This PR (HEAD: 0cd8c92) has been imported to Gerrit for code review. Please visit https://go-review.googlesource.com/c/go/+/392314 to see it. Tip: You can toggle comments from me using the |
Message from hzzb: Patch Set 2: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from hzzb: Patch Set 3: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from hzzb: Patch Set 3: (4 comments) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Michael Pratt: Patch Set 3: Run-TryBot+1 Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 3: (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Message from Gopher Robot: Patch Set 3: TryBot-Result+1 (1 comment) Please don’t reply on this GitHub thread. Visit golang.org/cl/392314. |
Currently, the scheduler picks up next to-run g in the following
delicate order:
1 locked g.
2 trace reader g.
3 GC backgroud mark worker g.
4 global runq if schedtick%61 == 0.
5 local runq including timers.
6 findrunnable loop.
6.1 finalizer g.
6.2 local runq including timers.
6.3 global runq, no schedtick checked.
6.4 non-blocking netpoll(0) if netpollWaiters > 0 and no other M polling.
6.5 steal any timers and runnable g from other P. spinning state.
6.6 GC idle mark worker g.
6.7 recheck all other P's local runq, idle GC marker and timers.
non-spinning state, P is idle.
6.8 blocking netpoll(delay) until next earliest timer or IO ready.
if any other M already polling, no blocking netpoll. M parked.
If found a runnable g in any step, all steps following it will be skipped.
Suppose there are many runnable gs, scheduler would almost find a g before
trying findrunnable, so netpoll left untouched. Normal Ms have no opportunity
to unpark netpoll waiters g even though IO readied. The special M sysmon polls
network if not polled for more than 10ms, which is the only opportunity to
ready any netpoll waiters g into the global runq. This causes a false positive
timeout problem.
I did a simple test to demostrate this problem. Dialing a localhost TCP
listening port costs about 10ms. This is obviously not credible. See file
runtime/sysmon_test.go for detail test code.
Normal Ms will sleep in netpoll waiting for timer and IO ready if no g
found at last. The special M sysmon simply sleeps 20us~10ms and does
a non-blocking netpoll if needed. This CL let sysmon sleep in netpoll,
do a blocking netpoll like what normal Ms do. So netpoll waiters g will
be readied into runq as soon as IO readied.
With this change, the above mentioned test shows that dialing to localhost
TCP listening port costs mostly less than 1ms. Its much more reasonable than
10ms. Detailed test results for the old and new versions are as follows:
$ ../../bin/go test -v -run TestSysmonReadyNetpollWaitersASAP
=== RUN TestSysmonReadyNetpollWaitersASAP
sysmon_test.go:105: dialed 85 times within 1.000029211s
sysmon_test.go:106: timeBucket count percent
sysmon_test.go:108: [ 0, 1)ms 1 1.18%
sysmon_test.go:108: [ 1, 2)ms 0 0.00%
sysmon_test.go:108: [ 2, 3)ms 0 0.00%
sysmon_test.go:108: [ 3, 4)ms 0 0.00%
sysmon_test.go:108: [ 4, 5)ms 0 0.00%
sysmon_test.go:108: [ 5, 6)ms 0 0.00%
sysmon_test.go:108: [ 6, 7)ms 0 0.00%
sysmon_test.go:108: [ 7, 8)ms 0 0.00%
sysmon_test.go:108: [ 8, 9)ms 1 1.18%
sysmon_test.go:108: [ 9,10)ms 2 2.35%
sysmon_test.go:108: [10,11)ms 32 37.65%
sysmon_test.go:108: [11,12)ms 49 57.65%
--- FAIL: TestSysmonReadyNetpollWaitersASAP (1.11s)
$ ../../bin/go test -v -run TestSysmonReadyNetpollWaitersASAP
=== RUN TestSysmonReadyNetpollWaitersASAP
sysmon_test.go:105: dialed 2369 times within 1.000021368s
sysmon_test.go:106: timeBucket count percent
sysmon_test.go:108: [ 0, 1)ms 2209 93.25%
sysmon_test.go:108: [ 1, 2)ms 13 0.55%
sysmon_test.go:108: [ 2, 3)ms 33 1.39%
sysmon_test.go:108: [ 3, 4)ms 68 2.87%
sysmon_test.go:108: [ 4, 5)ms 8 0.34%
sysmon_test.go:108: [ 5, 6)ms 9 0.38%
sysmon_test.go:108: [ 6, 7)ms 6 0.25%
sysmon_test.go:108: [ 7, 8)ms 4 0.17%
sysmon_test.go:108: [ 8, 9)ms 7 0.30%
sysmon_test.go:108: [ 9,10)ms 4 0.17%
sysmon_test.go:108: [10,11)ms 5 0.21%
sysmon_test.go:108: [11,12)ms 3 0.13%
--- PASS: TestSysmonReadyNetpollWaitersASAP (1.11s)