You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this issue is the design that will hopefully resolve issues with the blend state, mostly #83 but it's also come up in #155 and possibly other issues.
There are several parts. I'll work backward from fine rasterization, as that may be a clearer presentation.
In fine rasterization, the blend stack is split into two segments. The first segment is a small array in vector registers (ie function scope). The exact size of that array is a tunable parameter, small enough it doesn't cause a spill to scratch, large enough to accommodate "most" blending. Also note, this parameter interacts with CHUNK (the number of pixels rendered by a single thread). My gut feeling is that reducing CHUNK to 4 and an array size of 4 will be close to the sweet spot.
The second segment is a region of memory in a separate read/write blend stack binding. This way, the memory (ptcl) binding can be read-only, which appears to be important on at least some hardware (Pixel 4, see #83). The offset of that region is written into the ptcl by coarse rasterization.
The logic to select between the two is a simple clip_depth < BLEND_STACK_SIZE predicate. I've considered more complex logic, something like a stack window, but I think keeping it simple is a win.
In coarse rasterization, we reinstate the logic by Elias to compute maximum blend stack depth. However, we subtract off the small array size and only allocate if it's above that. Note that this lifts a hard limit on blend depth, but at the same time introduces a new requirement for memory allocation. Obviously the blend stack buffer needs to be large enough to hold all allocations. We'll have another architecture issue coming up soon with a more detailed plan for that. For the time being, it's come up with a number for the size of that buffer, cross fingers and hope it's big enough.
One more point, addressing approaches considered and rejected. It's disappointing that memory used by the blend stack can't be recovered after a workgroup does its rendering. I thought about a malloc/free approach, but there are a number of problems. One is that free requires release semantics and malloc requires acquire, if memory is to be reused, and this barrier is not available on current Metal. Another is that while it's very likely that the number of concurrently resident workgroups is much smaller than the total, it is very difficult to quantify that in a reliable, portable way. Thus, a conservative or worst-case estimate for the required allocation may not be significantly smaller than adding up the total allocation, and the difference in complexity is significant.
The text was updated successfully, but these errors were encountered:
Split the blend stack into register and memory segments. Do blending in registers up to that size, then spill to memory if needed.
This version may regress performance on Pixel 4, as it uses common memory for the blend stack, rather than keeping that memory read-only in fine rasterization, and using a separate buffer for blend stack. This needs investigation. It's possible we'll want to have single common memory as a config option, as it pools allocations and decreases the probability of failure.
Also a flaw in this version: there is no checking of memory overflow.
For understanding code history: this commit largely reverts #77, but there were some intervening changes to blending, and this commit also implements the split so some of the stack is in registers.
Closes#156
In this issue is the design that will hopefully resolve issues with the blend state, mostly #83 but it's also come up in #155 and possibly other issues.
There are several parts. I'll work backward from fine rasterization, as that may be a clearer presentation.
In fine rasterization, the blend stack is split into two segments. The first segment is a small array in vector registers (ie function scope). The exact size of that array is a tunable parameter, small enough it doesn't cause a spill to scratch, large enough to accommodate "most" blending. Also note, this parameter interacts with CHUNK (the number of pixels rendered by a single thread). My gut feeling is that reducing CHUNK to 4 and an array size of 4 will be close to the sweet spot.
The second segment is a region of memory in a separate read/write blend stack binding. This way, the memory (ptcl) binding can be read-only, which appears to be important on at least some hardware (Pixel 4, see #83). The offset of that region is written into the ptcl by coarse rasterization.
The logic to select between the two is a simple
clip_depth < BLEND_STACK_SIZE
predicate. I've considered more complex logic, something like a stack window, but I think keeping it simple is a win.In coarse rasterization, we reinstate the logic by Elias to compute maximum blend stack depth. However, we subtract off the small array size and only allocate if it's above that. Note that this lifts a hard limit on blend depth, but at the same time introduces a new requirement for memory allocation. Obviously the blend stack buffer needs to be large enough to hold all allocations. We'll have another architecture issue coming up soon with a more detailed plan for that. For the time being, it's come up with a number for the size of that buffer, cross fingers and hope it's big enough.
One more point, addressing approaches considered and rejected. It's disappointing that memory used by the blend stack can't be recovered after a workgroup does its rendering. I thought about a malloc/free approach, but there are a number of problems. One is that free requires release semantics and malloc requires acquire, if memory is to be reused, and this barrier is not available on current Metal. Another is that while it's very likely that the number of concurrently resident workgroups is much smaller than the total, it is very difficult to quantify that in a reliable, portable way. Thus, a conservative or worst-case estimate for the required allocation may not be significantly smaller than adding up the total allocation, and the difference in complexity is significant.
The text was updated successfully, but these errors were encountered: