-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve capture sleep scheduling on Windows #1288
Conversation
Currently in draft, the code is missing comments in some places. And needs careful testing, as it affects the core of the program. |
This is actually not true, we did use common reference point. Just that point was set before snapshot() call, and not after. So it did fine in limiting the rate, but had problems with the frame timings within that rate, It was especially obvious when we had mouse movement with dynamic background, and delaying the release of duplication surface made the issue worse. |
Should be ready for wider testing. |
Performance looks great on my end on an Arc A770. With frame pacing enabled in Moonlight, integral frame rates of the stream capture rate are smooth (30/60/120 FPS with a 120 Hz display, or 30/60 FPS with a 60 Hz display). The only configuration that didn't perform well was with the display at 144 Hz, since the refresh rate of my client can't evenly divide that. The solution to that is adjusting the host display mode, so that's something to tackle as a separate problem. @psyke83 @HakanFly @FrogTheFrog When you have a chance, please give this build a try to check pacing performance. |
I retrieve the artifact. Is activating the frame pacing option mandatory for testing the PR ? |
Frame pacing is not mandatory, but if you're looking for perfectly smooth (or as close as possible) video, you won't get it without frame pacing enabled. It shouldn't microstutter any worse than before with frame pacing off. |
I can confirm that it's very smooth with my RX 7900XTX. (60 or 120 FPS @4k) However, frame pacing leads to an increase in the average frame queue delay (I get spikes at 21ms) in HEVC. Then I use VRR/Freesync on the client PC. I don't feel the need to activate frame pacing or v-sync. But I'm not the most sensitive in this domain. |
As far as I can tell, improvement to capture frame rate is expected. Our previous algorithm was good at limiting capture fps, but more than it was absolutely required in some situations. |
My experience is a little different :/. My test setup:
I tested 2 cases:
I have also applied this hack to both cases and whenever they start going bonkers, it immediately fixes it withing a couple of frames and I would not know the frametime is going bonkers unless I look at the graph. |
@ns6089 please consider adding the hack even if it's under some optional flag or something. I have been living with option 1+hack for months now and it just feels so good :'( |
Does your hack work with disabled frame limiter? |
You mean game is running at 120FPS, while streaming 60FPS? |
Yes. Desktop at 120, moonlight at 60, and no rtss frame limiter. |
Will need ~20 mins to test it properly. |
Also
Is it exactly 120Hz? If capture logic tries to match your display, and external frame limiter doesn't, of course it's going to cause problems. And I don't think we can solve it nicely (outside of some config option), because sunshine has no information about external frame limiters. |
According to NVidia setting, yes. It's not that decimal Hz stuff that you see on some monitors. |
I think NVIDIA Control Panel does rounding. Look into Advanced display setting in Windows settings. |
With CP 2077 benchmark, I have 3,7% performance loss on average FPS. (AMD UWP HAGS modded drivers) |
That "min 9.1ms" shouldn't happen at all without the "hack" when moonlight is requesting 60fps and network is stable. Are you sure you're testing it correctly? |
Also make sure that scheduling priority is reverted to realtime https://github.com/LizardByte/Sunshine/blob/nightly/src/platform/windows/display_base.cpp#L517 |
Also I will add capture frame interval stats collection to log tomorrow (european time). Testing it on steam deck introduces too many variables not directly related to capture. |
So a reliable indicator is to install RTSS on the client side and measure frametime with RT graph and min/max? |
I'm positive that I'm doing it correctly, maybe SteamDeck does some stuff as you've mentioned.
Oh wow, testing with realtime only and just wow. Need to confirm with more testing, but no hack is necessary and that new commit of yours no longers does anything bad both in capped and uncapped modes. |
A surprise, but not unwelcome one. I have no explanation for this.
That will be game rendering frame times, capture is operating on different rate and with various delays. It needs to be done in sunshine code. |
@ns6089 so realtime+hack does seem to cause very tiny (like 2-3 pixel) bumps now and then, but could be a viable solution until the realtime bug is fixed for nvidia? |
That hack still makes no sense to me to be honest. I made very sure that in no point of time capture frame interval is lower than requested frame interval, because this can easily lead to network buffer overflows and dropped packets. If you want to limit average capture frame intervals across some window, then you need to add said window that will track the past. |
I still have high hopes for the AMD Direct Capture Path ^^'. |
I have a guess. Back when were looking at the RDNA3 performance issue in GPUView, I remember seeing 2 chunks of DWM work happening per frame, and we guessed that one of those was a redundant copy. Despite MSDN still claiming that holding As for the performance increase my guess is: when the frame hasn't been released, DWM has to copy each new frame to an internal staging texture because it's unable to directly copy to ours. As a result, there's an extra copy from the staging texture to our texture when we next call Since DWM should now almost always own the texture when each frame is composited, it can just immediately return it to us without a redundant copy when we call |
I'm afraid I can relate. My theory now is that the capture/sleep/whatever "window" starts to shift a little due to nvidia lagging behind and the hack simply makes it sleep less to compensate. This will result in sleeping too little some times and Then This is my only explanation with many assumptions and too little knowledge. |
The root of your issue is that frame limiters work by design by averaging frame times. They try to follow unexpectedly short frame times with reactively long, and unexpectedly long with reactively short. Now remember that we can't have neither "unexpectedly short", nor "reactively short" (because unpaced UDP), unless we add averaging too. I think what your hack actually did - it made requested frame rate behave higher than it actually was, like pacing for 70fps instead of expected 60fps. |
Initial testing looks quite good. Testing host with 5700x and RX 6600 at 1440p, 60fps capped host with vsync disabled to 1440p 60fps client, and it feels very smooth in all titles tested. Unlike @HakanFly, I didn't see any significant reduction in performance overhead on my system. CP2077: Baseline is 98.03fps (no capture), nightly is 93.03fps (~94.9%), and this PR is 92.57fps (94.4%). I'm testing with the standard 23.8.1 driver, so the lack of HAGS and lower capture resolution might explain why there's no significant overhead difference on my system. |
Then I think it should be safe to merge? |
I approve. Even though I have issues it's unrelated to the changes and is an overall improvement. |
No need of HAGS with this stats ^^'
I think this is clearly a green flag on AMD's side. |
@ns6089 The hack can be replaced with the "unsupported FPS" setting in the Moonlight (as you have suggested in Discord), however I had to specify 61FPS via command line :/ |
AMF doesn't do it and MSDN is unclear about it.
With the latest changes it still LGTM. Maybe a slight improvement, but still need 61fps thingy for gpu-intensive games until the realtime bug is fixed in driver. |
Description
This adds common synchronization point for consecutive frames, previously we calculated sleep duration on frame-to-frame basis, and this quickly accumulated errors because sleep function always overshoots (up to 1.5ms normally, 0.5ms in average).
Also removes the delay in releasing duplication surface, this dropped frames when the surface was not available during flip. And removes DwmFlush, because this hack is no longer necessary.
Related #1203, #1168
Type of Change
.github/...
)Checklist
Branch Updates
LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.