Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve capture sleep scheduling on Windows #1288

Merged
merged 8 commits into from
Aug 27, 2023
Merged

Conversation

ns6089
Copy link
Contributor

@ns6089 ns6089 commented May 16, 2023

Description

This adds common synchronization point for consecutive frames, previously we calculated sleep duration on frame-to-frame basis, and this quickly accumulated errors because sleep function always overshoots (up to 1.5ms normally, 0.5ms in average).

Also removes the delay in releasing duplication surface, this dropped frames when the surface was not available during flip. And removes DwmFlush, because this hack is no longer necessary.

Related #1203, #1168

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Dependency update (updates to dependencies)
  • Documentation update (changes to documentation)
  • Repository update (changes to repository files, e.g. .github/...)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.

  • I want maintainers to keep my branch updated

@ns6089
Copy link
Contributor Author

ns6089 commented May 16, 2023

Currently in draft, the code is missing comments in some places. And needs careful testing, as it affects the core of the program.

@ns6089
Copy link
Contributor Author

ns6089 commented May 16, 2023

previously we calculated sleep duration on frame-to-frame basis

This is actually not true, we did use common reference point. Just that point was set before snapshot() call, and not after. So it did fine in limiting the rate, but had problems with the frame timings within that rate, It was especially obvious when we had mouse movement with dynamic background, and delaying the release of duplication surface made the issue worse.

@ns6089 ns6089 assigned ns6089 and unassigned ns6089 Jul 11, 2023
@ns6089 ns6089 self-assigned this Aug 26, 2023
@ns6089 ns6089 marked this pull request as ready for review August 26, 2023 11:26
@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

Should be ready for wider testing.
Works really well for me on Windows 10, SDR and NVIDIA. But no idea about other configs.
Reverting NVIDIA gpu scheduling priority back to realtime prior to testing possibly makes sense too, since it's going to be the default (once I finish new config page and d3d vram reservation).

@cgutman
Copy link
Collaborator

cgutman commented Aug 26, 2023

Performance looks great on my end on an Arc A770. With frame pacing enabled in Moonlight, integral frame rates of the stream capture rate are smooth (30/60/120 FPS with a 120 Hz display, or 30/60 FPS with a 60 Hz display). The only configuration that didn't perform well was with the display at 144 Hz, since the refresh rate of my client can't evenly divide that. The solution to that is adjusting the host display mode, so that's something to tackle as a separate problem.

@psyke83 @HakanFly @FrogTheFrog When you have a chance, please give this build a try to check pacing performance.

@HakanFly
Copy link

I retrieve the artifact. Is activating the frame pacing option mandatory for testing the PR ?

@cgutman
Copy link
Collaborator

cgutman commented Aug 26, 2023

Frame pacing is not mandatory, but if you're looking for perfectly smooth (or as close as possible) video, you won't get it without frame pacing enabled. It shouldn't microstutter any worse than before with frame pacing off.

@HakanFly
Copy link

I can confirm that it's very smooth with my RX 7900XTX. (60 or 120 FPS @4k)
Is it a placebo or does it seem to have a positive impact on performance? I've been using Spider-man remaster for months to test streaming performance with different drivers etc. My minimum FPS has never been so high (~90 vs 105).

However, frame pacing leads to an increase in the average frame queue delay (I get spikes at 21ms) in HEVC.
It's much more contained with HEVC MAIN10 SDR.
But I think that's another subject.

Then I use VRR/Freesync on the client PC. I don't feel the need to activate frame pacing or v-sync. But I'm not the most sensitive in this domain.

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

Is it a placebo or does it seem to have a positive impact on performance? I've been using Spider-man remaster for months to test streaming performance with different drivers etc. My minimum FPS has never been so high (~90 vs 105).

As far as I can tell, improvement to capture frame rate is expected. Our previous algorithm was good at limiting capture fps, but more than it was absolutely required in some situations.

@FrogTheFrog
Copy link
Collaborator

FrogTheFrog commented Aug 26, 2023

My experience is a little different :/.

My test setup:

  • RTX 3080Ti
  • Desktop framerate of 120Hz with games limited to 60FPS via RTSS
  • Ran the same route in a vehicle in Cyberpunk

I tested 2 cases:

  1. Without e21b571 it is smooth most of the time until Cyberpunk loads something and then the frametime goes bonkers for about 5-10 seconds. Sometimes when the game loads frametime immediately starts going bonkers.
  2. With e21b571 the reported capture time is 59.9999 fps. The frametime becomes unstable more often than without this commit. If the we compare how the frametime looks without and with the commit when it goes bonkers, without it looks more or less like a noise where as with the commit it looks like hacksaw teeth:
    image

I have also applied this hack to both cases and whenever they start going bonkers, it immediately fixes it withing a couple of frames and I would not know the frametime is going bonkers unless I look at the graph.

@FrogTheFrog
Copy link
Collaborator

@ns6089 please consider adding the hack even if it's under some optional flag or something. I have been living with option 1+hack for months now and it just feels so good :'(

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

@ns6089 please consider adding the hack even if it's under some optional flag or something. I have been living with option 1+hack for months now and it just feels so good :'(

Does your hack work with disabled frame limiter?

@FrogTheFrog
Copy link
Collaborator

@ns6089 please consider adding the hack even if it's under some optional flag or something. I have been living with option 1+hack for months now and it just feels so good :'(

Does your hack work with disabled frame limiter?

You mean game is running at 120FPS, while streaming 60FPS?
I sadly can't stream higher :/

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

You mean game is running at 120FPS, while streaming 60FPS?

Yes. Desktop at 120, moonlight at 60, and no rtss frame limiter.

@FrogTheFrog
Copy link
Collaborator

You mean game is running at 120FPS, while streaming 60FPS?

Yes. Desktop at 120, moonlight at 60, and no rtss frame limiter.

Will need ~20 mins to test it properly.

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

Also

Desktop framerate of 120Hz

Is it exactly 120Hz? If capture logic tries to match your display, and external frame limiter doesn't, of course it's going to cause problems. And I don't think we can solve it nicely (outside of some config option), because sunshine has no information about external frame limiters.

@FrogTheFrog
Copy link
Collaborator

Also

Desktop framerate of 120Hz

Is it exactly 120Hz? If capture logic tries to match your display, and external frame limiter doesn't, of course it's going to cause problems. And I don't think we can solve it nicely (outside of some config option), because sunshine has no information about external frame limiters.

According to NVidia setting, yes. It's not that decimal Hz stuff that you see on some monitors.

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

I think NVIDIA Control Panel does rounding. Look into Advanced display setting in Windows settings.

@HakanFly
Copy link

With CP 2077 benchmark, I have 3,7% performance loss on average FPS. (AMD UWP HAGS modded drivers)
It's usually 5-7%.

@FrogTheFrog
Copy link
Collaborator

FrogTheFrog commented Aug 26, 2023

You mean game is running at 120FPS, while streaming 60FPS?

Yes. Desktop at 120, moonlight at 60, and no rtss frame limiter.

It seems that I could not get it to run at 120FPS, so ~90FPS will do?

If yes, then here are some frametimes without hack when they are stable-ish:
image
image

There are times when the amplitude is bigger, but the frequency is more or less the same.

Here are frametimes with the hack:
image
image
image

Most of the times it's smooth or this pulsating pattern ^

So, yeah the hack does something

@FrogTheFrog
Copy link
Collaborator

I think NVIDIA Control Panel does rounding. Look into Advanced display setting in Windows settings.

Seems to be 120Hz:
image

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

That "min 9.1ms" shouldn't happen at all without the "hack" when moonlight is requesting 60fps and network is stable. Are you sure you're testing it correctly?

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

Also make sure that scheduling priority is reverted to realtime https://github.com/LizardByte/Sunshine/blob/nightly/src/platform/windows/display_base.cpp#L517
Your instabilities may come from there

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

Also I will add capture frame interval stats collection to log tomorrow (european time). Testing it on steam deck introduces too many variables not directly related to capture.

@HakanFly
Copy link

So a reliable indicator is to install RTSS on the client side and measure frametime with RT graph and min/max?

@FrogTheFrog
Copy link
Collaborator

That "min 9.1ms" shouldn't happen at all without the "hack" when moonlight is requesting 60fps and network is stable. Are you sure you're testing it correctly?

I'm positive that I'm doing it correctly, maybe SteamDeck does some stuff as you've mentioned.

Also make sure that scheduling priority is reverted to realtime https://github.com/LizardByte/Sunshine/blob/nightly/src/platform/windows/display_base.cpp#L517
Your instabilities may come from there

Oh wow, testing with realtime only and just wow. Need to confirm with more testing, but no hack is necessary and that new commit of yours no longers does anything bad both in capped and uncapped modes.

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

With CP 2077 benchmark, I have 3,7% performance loss on average FPS. (AMD UWP HAGS modded drivers)
It's usually 5-7%.

A surprise, but not unwelcome one. I have no explanation for this.

So a reliable indicator is to install RTSS on the client side and measure frametime with RT graph and min/max?

That will be game rendering frame times, capture is operating on different rate and with various delays. It needs to be done in sunshine code.

@FrogTheFrog
Copy link
Collaborator

@ns6089 so realtime+hack does seem to cause very tiny (like 2-3 pixel) bumps now and then, but could be a viable solution until the realtime bug is fixed for nvidia?

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

@ns6089 so realtime+hack does seem to cause very tiny (like 2-3 pixel) bumps now and then, but could be a viable solution until the realtime bug is fixed for nvidia?

That hack still makes no sense to me to be honest. I made very sure that in no point of time capture frame interval is lower than requested frame interval, because this can easily lead to network buffer overflows and dropped packets. If you want to limit average capture frame intervals across some window, then you need to add said window that will track the past.

@HakanFly
Copy link

HakanFly commented Aug 26, 2023

A surprise, but not unwelcome one. I have no explanation for this.

I still have high hopes for the AMD Direct Capture Path ^^'.
I'm too dependent on HAGS activation. And the latest drivers UWP version no longer supports it.

@cgutman
Copy link
Collaborator

cgutman commented Aug 26, 2023

A surprise, but not unwelcome one. I have no explanation for this.

I have a guess. Back when were looking at the RDNA3 performance issue in GPUView, I remember seeing 2 chunks of DWM work happening per frame, and we guessed that one of those was a redundant copy.

Despite MSDN still claiming that holding ReleaseFrame() to the last possible moment is the most performant option, I suspect this changed (perhaps when IDXGIOutput5::DuplicateOutput1() came along). The docs also reference dirty region tracking as being a significant reason for delaying ReleaseFrame(). However, modern version of Windows don't even report individual dirty rects anymore, so I suspect they no longer track them and always perform the full copy. The tracking overhead was probably worse than just doing the full copy every time.

As for the performance increase my guess is: when the frame hasn't been released, DWM has to copy each new frame to an internal staging texture because it's unable to directly copy to ours. As a result, there's an extra copy from the staging texture to our texture when we next call AcquireNextFrame(). That's what the second block of work is from DWM in GPUView.

Since DWM should now almost always own the texture when each frame is composited, it can just immediately return it to us without a redundant copy when we call AcquireNextFrame(), because the copy already happened during composition. A copy on its own wouldn't normally be very expensive, but it turns out to be quite slow because it requires a GPU process context switch. These context switches have significant performance penalties, particularly with HAGS disabled.

@FrogTheFrog
Copy link
Collaborator

That hack still makes no sense to me to be honest.

I'm afraid I can relate.

My theory now is that the capture/sleep/whatever "window" starts to shift a little due to nvidia lagging behind and the hack simply makes it sleep less to compensate. This will result in sleeping too little some times and status = snapshot(pull_free_image_cb, img_out, 0ms, *cursor); returning a bad status.

Then status = snapshot(pull_free_image_cb, img_out, 1000ms, *cursor); would be called to do a proper sleep and to realign the "window".

This is my only explanation with many assumptions and too little knowledge.

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 26, 2023

The root of your issue is that frame limiters work by design by averaging frame times. They try to follow unexpectedly short frame times with reactively long, and unexpectedly long with reactively short. Now remember that we can't have neither "unexpectedly short", nor "reactively short" (because unpaced UDP), unless we add averaging too. I think what your hack actually did - it made requested frame rate behave higher than it actually was, like pacing for 70fps instead of expected 60fps.

@psyke83
Copy link
Collaborator

psyke83 commented Aug 26, 2023

Initial testing looks quite good. Testing host with 5700x and RX 6600 at 1440p, 60fps capped host with vsync disabled to 1440p 60fps client, and it feels very smooth in all titles tested.

Unlike @HakanFly, I didn't see any significant reduction in performance overhead on my system. CP2077: Baseline is 98.03fps (no capture), nightly is 93.03fps (~94.9%), and this PR is 92.57fps (94.4%). I'm testing with the standard 23.8.1 driver, so the lack of HAGS and lower capture resolution might explain why there's no significant overhead difference on my system.

@ns6089
Copy link
Contributor Author

ns6089 commented Aug 27, 2023

Then I think it should be safe to merge?
Additional functionality can be added at a later date. For example I still believe frame pacing should be done per encoder, not in capture.

@FrogTheFrog
Copy link
Collaborator

I approve. Even though I have issues it's unrelated to the changes and is an overall improvement.

@HakanFly
Copy link

Unlike @HakanFly, I didn't see any significant reduction in performance overhead on my system. CP2077: Baseline is 98.03fps (no capture), nightly is 93.03fps (~94.9%), and this PR is 92.57fps (94.4%). I'm testing with the standard 23.8.1 driver, so the lack of HAGS and lower capture resolution might explain why there's no significant overhead difference on my system.

No need of HAGS with this stats ^^'

Then I think it should be safe to merge?

I think this is clearly a green flag on AMD's side.

@FrogTheFrog
Copy link
Collaborator

@ns6089 The hack can be replaced with the "unsupported FPS" setting in the Moonlight (as you have suggested in Discord), however I had to specify 61FPS via command line :/

@FrogTheFrog
Copy link
Collaborator

With the latest changes it still LGTM. Maybe a slight improvement, but still need 61fps thingy for gpu-intensive games until the realtime bug is fixed in driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants