-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove manual blend stack spilling and rely on scratch memory instead #77
Conversation
piet-gpu/shader/kernel4.comp
Outdated
barrier(); | ||
return sh_clip_alloc; | ||
} | ||
#define MAX_BLEND_STACK 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code didn't have a limit on the stack size, however large. It may be ok, but there are few (no?) other hard limits in piet-gpu.
We can compute the worst case stack depth during encode; can that information be used here somehow? Perhaps a specialization constant, but I'd like to avoid dynamic shader source to make it easier to do CPU fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a rather unfortunate limitation of this approach, and no, I don't think we can change the allocation size of the scratch buffer without dynamic shader in a portable way. Even then, 256 is a quite conservative value, so I hope it will never be required to be raised.
piet-gpu/shader/kernel4.comp
Outdated
@@ -201,15 +201,16 @@ void main() { | |||
break; | |||
case Cmd_BeginClip: | |||
for (uint k = 0; k < CHUNK; k++) { | |||
blend_stack[blend_sp][k] = packsRGB(vec4(rgb[k], clamp(abs(area[k]), 0.0, 1.0))); | |||
// Safeguard against overflow, but do not guarantee correct behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be ok to do here, but I'll note that piet-gpu doesn't have many other of these runtime checks or guards. Perhaps this check should go in the CPU driver, where the user can get a proper error about the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an additional check in render_ctx.
fce1d80
to
85e1f54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth doing. I also see some improvement, though it's a little hard to measure on this system (I need to figure out a better way to lock clock speed).
The question of hard limits is a tricky one. The old code can certainly fail too, especially as memory is not currently reclaimed at the end of the workgroup. There are, as usual, some mitigations that can be employed. Probably one of my favorites is to monitor the maximum clip depth and set a specialization constant in the fine raster shader based on it.
A much more sophisticated approach would be to break down highly complex scenes into layers, with intermediate results saved to textures. There's a lot of complexity and subtlety to that, especially when you go from ordinary Porter-Duff "over" to arbitrary blend modes; the former are associative which means that a subscene can be rendered independently and composited in, but arbitrary blending modes (when "isolated" is false) require the underlayers to be present in the buffer when drawing begins. Implementing all that is likely not necessary, as it's pretty far out of line of what's reasonable. (I do think fairly deeply nested clip is reasonable, as one can imagine a UI in which containers clip their children)
But with the non-associative blend modes aside, rendering layers to textures might ultimately be a useful strategy. I probably wouldn't implement it just for this, but there are other reasons to implement the mechanism, one being that intermediate textures are necessary for blur, and another being that a layer-to-texture mechanism might speed up certain things like scrolling (with integer pixel offsets).
Overall I think it's worth merging this code now because it's a simplification of the code, and that's good.
@@ -204,6 +205,9 @@ impl RenderContext for PietGpuRenderContext { | |||
self.elements.push(Element::BeginClip(Clip { | |||
bbox: Default::default(), | |||
})); | |||
if self.clip_stack.len() >= MAX_BLEND_STACK { | |||
panic!("Maximum clip/blend stack size {} exceeded", MAX_BLEND_STACK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this should be a panic. For one, it is actually pretty likely that the clip stack depth in fine rasterization is significantly less than the depth in the scene graph, because of optimizations (all empty or all alpha = 1.0 tiles). But I can see the rationale of catching it here, so the GPU doesn't need to handle overflow cases.
piet-gpu/shader/kernel4.comp
Outdated
for (uint k = 0; k < CHUNK; k++) { | ||
blend_stack[blend_slot][k] = packsRGB(vec4(rgb[k], clamp(abs(area[k]), 0.0, 1.0))); | ||
// We reject any inputs that might overflow in render_ctx.rs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems goofy here.
Please compare 22507de which simplifies kernel4 without imposing hard limits on clip depth. It does so by computing the maximum depth of the clip stack in coarse.comp, and allocating scratch space before running kernel4. In my tests, the performance is ~unchanged even when doubling the stack entry size to support RGBA output. Without pre-allocation, RGBA support doubled k4 time on my Intel GPU. |
Ah, will take a look at that shortly, didn't realize it was also addressing the same issue. Let's hold off on merging this until I've had a chance to look. |
With #75 merged, I believe we can use this work as a potential optimization of the clip stack. I suggest we maintain a stack of N (MAX_BLEND_STACK in this PR) clip states in thread local memory (blend_stack in this PR). Then, add two new commands to ptcl.rs: BeginClipLocal and EndClipLocal. They behave like BeginClip/EndClip but use the thread local stack instead of global memory. Finally, change coarse.comp to use BeginClipLocal/EndClipLocal for the topmost N clip operations, and BeginClip/EndClip for the remaining. It may be worth keeping the thread local RGBA colors in a vec4 to avoid the sRGB conversions. It may also be worth adding specialized BeginClipLocal0, BeginClipLocal1... commands for low values of N. |
Those surely sounds like some interesting optimizations to explore. Though, given the potential complexity (and the fact that it's probably low impact) I'm not going to prioritize this right now. |
OK, so while I'm on this, writing down some thoughts: The "fast" path could using shared memory instead of private memory because it can reliably avoid global memory traffic. (Raph mentioned that "dirty entries at the end of workgroup can simply be discarded rather than written all the way through", but that probably wouldn't happen in current gen hardware because all the faster caches are writethrough.) But shared memory is also pretty scarce, so we can only have around 1–3 entries in that. But at the end of the day, shared memory on most GPUs is just L0/L1 cache without the adaptive replacement capabilities, so I'm not sure if it's even worth doing that. So the more naive private memory approach is probably still better. |
b967956
to
0347a77
Compare
Rebased. Sample scene output looks OK, but I mostly forgot how the rgba blending works so I'd appreciate someone double checking this. Edit: also halved the stack limit to 128 to compensate for the increase in data stored. Given the context in #83 the direction is to use a fixed limit which hopefully is fine (at least for typical vector documents that are not too creative). I'm still unsure regarding #77 (comment): I chose panic mostly due to laziness (and the lack of proper logging integration in piet-gpu), and I mostly think this is fine until someone actually trips the limit. But maybe we should make it a bit more future proof. |
0347a77
to
56dee09
Compare
Actually hold on a bit, I'm getting TDR on this for whatever reason. |
v2: Add a panic when the nested blend depth exceeds the limit. v3: Rebase and partially remove code introduced in 22507de.
56dee09
to
7a2dc37
Compare
It was a leftover piece, the TDR is now fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, let's get this in, and thanks for updating it. After it's merged would be a good time to redo performance measurement on Pixel 4.
I do still have the concern that the panic on encoding when the depth is exceeded is too conservative, but it does prevent things from going badly wrong and I think can be revisited later.
Hmm, looks like I broke something and I now see glitches in the test scene. I'll try to come up with a fix soon, but we can also revert this if necessary. Edit: it actually works fine on Linux but not Windows (Radeon 5700 XT). Maybe it's because I'm running an old driver on Windows (for compatibility reasons) which I had some experience of it corrupting codegen. |
I'm not seeing glitches in the test scene (also Windows, Radeon 5700XT). It's quite plausible there's something wrong, it's not hard to trigger nondeterministic behavior, but I'd like to get this tracked down. Any hints about how to repro would be helpful, especially as I plan on rebasing new work (gradients right now) on top of this. |
At the time of test I was using the 20.5.1 HWS driver. I tested with it again and the issue was still there. I think downgrading the driver is enough to reproduce the issue. Then I upgraded the driver and the corruption is gone. So this is why it didn't reproduce on your end, and at least I can blame the driver if I don't find the root cause ;) The corruption looked like this, with random green-ish artifacts changing position every frame: (sorry it's not 1:1 canvas resolution since my device is at 125% scale) I captured a Radeon Graphics Profiler profile if you want to look at the shader assembly without reinstalling the driver. (Open, go to events tab, select "Event Timing", choose the kernel4 shader which is the bottommost invocation, then go to "Instruction Timing" to view the ISA.) I tried to compare code from this and a new driver version, but unfortunately codegen has changed too much and it wasn't feasible to analyze the difference. |
Split the blend stack into register and memory segments. Do blending in registers up to that size, then spill to memory if needed. This version may regress performance on Pixel 4, as it uses common memory for the blend stack, rather than keeping that memory read-only in fine rasterization, and using a separate buffer for blend stack. This needs investigation. It's possible we'll want to have single common memory as a config option, as it pools allocations and decreases the probability of failure. Also a flaw in this version: there is no checking of memory overflow. For understanding code history: this commit largely reverts #77, but there were some intervening changes to blending, and this commit also implements the split so some of the stack is in registers. Closes #156
This replaces the old register based blend stack (+manual allocation for spill) with one that relies on the shader compiler to spill the variables to private/scratch memory. It has been reliably spilled in my testing and I think it's probably OK to rely on it because:
It's worth noting that while the scratch memory is backed by global memory, it also goes through the L1 cache, making most of the loads as fast as shared memory.
The benefits of this approach are:
The memory usage due to overallocation might be a concern, but I think it's probably acceptable. My Radeon 5700 XT has 40 CUs and can execute up to 40 wave64s per CU. Therefore, the peak memory usage due to the spilling is:
40 CU * 40 wave64/CU * MAX_BLEND_STACK (=256) * 4 (bytes/entry) * 64 (waves/wavefront) = 104.9 MB
The required memory is proportional to the compute capacity of the GPU; therefore this should be equally negligible on lower-end GPUs. If this is still a concern, then lowering
MAX_STACK_SIZE
is also an option.I'm seeing around 15% improvement with the default test case (clip_test + alpha_test + things in render_scene). The new code itself is actually a little bit more expensive I think, but the extra occupancy (and other improvements) compensates for that.
One alternative to explore in the future is to use shared memory to store a small part of the array. This avoids writes to global memory/L2 cache and can give more consistent cache behavior. But currently I feel that the simple logic in this PR would be good enough.