Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory for blend state #156

Closed
raphlinus opened this issue Mar 2, 2022 · 0 comments · Fixed by #173
Closed

Memory for blend state #156

raphlinus opened this issue Mar 2, 2022 · 0 comments · Fixed by #173
Labels
architecture Changes to the way it's put together

Comments

@raphlinus
Copy link
Contributor

In this issue is the design that will hopefully resolve issues with the blend state, mostly #83 but it's also come up in #155 and possibly other issues.

There are several parts. I'll work backward from fine rasterization, as that may be a clearer presentation.

In fine rasterization, the blend stack is split into two segments. The first segment is a small array in vector registers (ie function scope). The exact size of that array is a tunable parameter, small enough it doesn't cause a spill to scratch, large enough to accommodate "most" blending. Also note, this parameter interacts with CHUNK (the number of pixels rendered by a single thread). My gut feeling is that reducing CHUNK to 4 and an array size of 4 will be close to the sweet spot.

The second segment is a region of memory in a separate read/write blend stack binding. This way, the memory (ptcl) binding can be read-only, which appears to be important on at least some hardware (Pixel 4, see #83). The offset of that region is written into the ptcl by coarse rasterization.

The logic to select between the two is a simple clip_depth < BLEND_STACK_SIZE predicate. I've considered more complex logic, something like a stack window, but I think keeping it simple is a win.

In coarse rasterization, we reinstate the logic by Elias to compute maximum blend stack depth. However, we subtract off the small array size and only allocate if it's above that. Note that this lifts a hard limit on blend depth, but at the same time introduces a new requirement for memory allocation. Obviously the blend stack buffer needs to be large enough to hold all allocations. We'll have another architecture issue coming up soon with a more detailed plan for that. For the time being, it's come up with a number for the size of that buffer, cross fingers and hope it's big enough.

One more point, addressing approaches considered and rejected. It's disappointing that memory used by the blend stack can't be recovered after a workgroup does its rendering. I thought about a malloc/free approach, but there are a number of problems. One is that free requires release semantics and malloc requires acquire, if memory is to be reused, and this barrier is not available on current Metal. Another is that while it's very likely that the number of concurrently resident workgroups is much smaller than the total, it is very difficult to quantify that in a reliable, portable way. Thus, a conservative or worst-case estimate for the required allocation may not be significantly smaller than adding up the total allocation, and the difference in complexity is significant.

@raphlinus raphlinus added the architecture Changes to the way it's put together label Mar 2, 2022
raphlinus added a commit that referenced this issue May 19, 2022
Split the blend stack into register and memory segments. Do blending in registers up to that size, then spill to memory if needed.

This version may regress performance on Pixel 4, as it uses common memory for the blend stack, rather than keeping that memory read-only in fine rasterization, and using a separate buffer for blend stack. This needs investigation. It's possible we'll want to have single common memory as a config option, as it pools allocations and decreases the probability of failure.

Also a flaw in this version: there is no checking of memory overflow.

For understanding code history: this commit largely reverts #77, but there were some intervening changes to blending, and this commit also implements the split so some of the stack is in registers.

Closes #156
@raphlinus raphlinus mentioned this issue May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture Changes to the way it's put together
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant