Add debug utilities for Vulkan #90993

darksylinc · 2024-04-21T23:21:31Z

Features:

Debug-only tracking of objects by type. See get_driver_allocs_by_object_type et al.
Debug-only Breadcrumb info for debugging GPU crashes and device lost
Performance report per frame from get_perf_report
Some VMA calls had to be modified in order to insert the necessary memory callbacks

Functionality marked as "debug-only" is only available in debug or dev builds.

Misc fixes:

Shaders are more aggressively unloaded
Early break optimization in RenderingDevice::uniform_set_create

============================

The work was performed by collaboration of TheForge and Google. I am merely splitting it up into smaller PRs and cleaning it up.

PR comments

This is the first PR from the TheForge changed. This one is the most important one because:

It sets the foundation for the rest of the upcoming changes.
99% of these changes should be harmless. They are not compiled in non-debug and non-dev builds.
They're quite verbose in terms of lines of code, so getting this out of the way really reduces the mental load.

Update

Thanks everyone for the help! Most of the issues have been addressed. Almost all CI passes now. Windows now compiles.

3 outstanding things remain:

~~Turning breadcrumbs into a single uint32_t. I want the enum to still be defined in rendering_device_commons.h and document it better.~~ Done.
~~CI is complaining about draw_list_begin changing its signature (this is correct; although I'm not sure if the error is triggering because the bindings are wrong?). Help & Tips on this are appreciated.~~
I've been talking with Dario, and it appears TheForge added RenderingDevice::shader_destroy_modules which must be called by hand to free (unused) Shader modules after creating the PSO. We both agree it seems like adding a shotgun to the API just to save few kilobytes of RAM (maybe a few megabytes?). I will remove it from this PR and ask TheForge for further info (perhaps it's a lot more RAM than we think?). It may also be a better idea to destroy shader modules after each PSO creation (unless they're shared), instead of doing it at higher level. Done for now. We'll revisit this later.

DarioSamo · 2024-04-21T23:51:23Z

servers/rendering/rendering_device_commons.h

+	enum BreadcrumbMarker {
+		NONE = 0,
+		// Environment
+		REFLECTION_PROBES,
+		SKY_PASS,
+		// Light mapping
+		LIGHTMAPPER_PASS,
+		// Shadows
+		SHADOW_PASS_DIRECTIONAL,
+		SHADOW_PASS_CUBE,
+		// Geometry passes
+		OPAQUE_PASS,
+		ALPHA_PASS,
+		TRANSPARENT_PASS,
+		// Screen effects
+		POST_PROCESSING_PASS,
+		BLIT_PASS,
+		UI_PASS,
+		// Other
+		DEBUG_PASS
+	};
+


I feel the responsibility of RenderingDevice should stay as abstract as possible in this case and not have knowledge of how the rendering pipeline works. Perhaps we can keep the breadcrumbs as a typedef of just uint32_t and the renderers (Forward+, Mobile, or future ones) can assign the semantic meaning to the numbers?

Breadcrumbs don't really add knowledge.

They're just tagging an ID. The idea is that if the app crashes we look at the report and see "what was being executed?" oh breadcrumb says ID = 2, which corresponds to... let me see... SKY_PASS.

It's up to the rendering pipelines to actually use those IDs.

In fact some parts of the code do the following:

breadcrumb = LIGHTMAPPER_PASS | (pass_id << 16)

In other words, it doesn't give knowledge about the upper levels; it's just a way of standardizing some common tags that are more likely to be used.

I totally agree with both of you.

Dario's comment is more about a separation of responsibilities. In the RenderingDevice API, we should just use uint32_t's and then in the Renderer itself, we should use an enum to standardize the common tags

But wouldn't that mean that:

Mobile and Forward+ would have their own copies of the enum, which can get out of sync. The main issue about that is that it can cause silly mistakes when debugging that can waste hours (i.e. user got id = 7, dev looked it up what that means, but saw the wrong version).

Users who want to use the RenderingDevice directly have no information about what that id is for, and even if they do, they would not adhere to any convention.

Btw I looked again and turns out only 16 bits of each breadcrumb argument are used:

uint32_t breadcrumb = (p_phase << 16) | (p_breadcrumb_data & ((1 << 16) - 1));

Perhaps we could change this argument signature to...?

// original: void draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values, float p_clear_depth, uint32_t p_clear_stencil, const Rect2 &p_region, RDD::BreadcrumbMarker p_phase, uint32_t p_breadcrumb_data) {} // My proposal: enum BreadcrumbMarker { // Stays the same. } uint32_t breadcrumb( BreadcrumbMarker p_phase, uint16_t p_data ) { return (p_phase << 16) | (p_data & 0xFFFF) } void draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values, float p_clear_depth, uint32_t p_clear_stencil, const Rect2 &p_region, uint32_t p_breadcrumb_data) {}

This way the BreadcrumbMarker enum is still visible to everyone, but it is made clear the underlying datatype is an arbitrary uint32_t.

Looking back on this. I like your suggested approach. I really don't love the RDD having high level concepts like POST_PROCESSING_PASS or SHADOW_PASS_CUBE, but at the same time, to make this feature useful we want the debug utility to print out the human readable name for the pass and not force users to look it up. You are right that looking up an ID would be a huge annoyance and significantly reduce the value of breadcrumbs.

DarioSamo · 2024-04-21T23:53:02Z

servers/rendering/renderer_rd/forward_mobile/render_forward_mobile.cpp

-		PipelineCacheRD *pipeline = nullptr;
-
-		pipeline = &shader->pipelines[cull_variant][primitive][shader_version];
+		PipelineCacheRD *pipeline = &shader->pipelines[cull_variant][primitive][shader_version];


Just as a heads up, we will have a merge conflict on this case as this code will be replaced in #90400. Seeing as it's for aesthetic reasons I'd forego this change and the removal of the comment below.

Ugh. This is exactly why I included it (so someone could notice it, it wasn't an aesthetic change but more like bait).

There are more upcoming changes that will affect RenderForwardMobile::_render_list_template, as there is an optimization that uses Spec Constants for key in sorting the draw calls for better draw call batching.

I will probably have to ask for your help when the time comes.

core/os/memory.cpp

servers/rendering/rendering_device.cpp

drivers/vulkan/rendering_device_driver_vulkan.cpp

core/os/memory.h

servers/rendering/rendering_device_driver.h

drivers/vulkan/rendering_device_driver_vulkan.cpp

doc/classes/RenderingDevice.xml

drivers/vulkan/rendering_device_driver_vulkan.cpp

servers/rendering/rendering_device.cpp

drivers/vulkan/rendering_device_driver_vulkan.cpp

servers/rendering/rendering_device.h

servers/rendering/rendering_device.cpp

darksylinc · 2024-04-27T21:34:47Z

I've made an update. Look at the "3 outstanding things remain:" in the top post where I ask for a bit of help 😄

doc/classes/RenderingDevice.xml

servers/rendering/rendering_device.cpp

@DarioSamo

The work was performed by collaboration of TheForge and Google. I am merely splitting it up into smaller PRs and cleaning it up. This is the most "risky" PR so far because the previous ones have been miscellaneous stuff aimed at either [improve debugging](godotengine#90993) (e.g. device lost), [improve Android experience](godotengine#96439) (add Swappy for better Frame Pacing + Pre-Transformed Swapchains for slightly better performance), or harmless [ASTC improvements](godotengine#96045) (better performance by simply toggling a feature when available). However this PR contains larger modifications aimed at improving performance or reducing memory fragmentation. With greater modifications, come greater risks of bugs or breakage. Changes introduced by this PR: TBDR GPUs (e.g. most of Android + iOS + M1 Apple) support rendering to Render Targets that are not backed by actual GPU memory (everything stays in cache). This works as long as load action isn't `LOAD`, and store action must be `DONT_CARE`. This saves VRAM (it also makes painfully obvious when a mistake introduces a performance regression). Of particular usefulness is when doing MSAA and keeping the raw MSAA content is not necessary. Some GPUs get faster when the sampler settings are hard-coded into the GLSL shaders (instead of being dynamically bound at runtime). This required changes to the GLSL shaders, PSO creation routines, Descriptor creation routines, and Descriptor binding routines. - `bool immutable_samplers_enabled = true` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. Immutable samplers requires that the samplers stay... immutable, hence this boolean is useful if the promise gets broken. We might want to turn this into a `GLOBAL_DEF` setting. Instead of creating dozen/hundreds/thousands of `VkDescriptorSet` every frame that need to be freed individually when they are no longer needed, they all get freed at once by resetting the whole pool. Once the whole pool is no longer in use by the GPU, it gets reset and its memory recycled. Descriptor sets that are created to be kept around for longer or forever (i.e. not created and freed within the same frame) **must not** use linear pools. There may be more than one pool per frame. How many pools per frame Godot ends up with depends on its capacity, and that is controlled by `rendering/rendering_device/vulkan/max_descriptors_per_pool`. - **Possible improvement for later:** It should be possible for Godot to adapt to how many descriptors per pool are needed on a per-key basis (i.e. grow their capacity like `std::vector` does) after rendering a few frames; which would be better than the current solution of having a single global value for all pools (`max_descriptors_per_pool`) that the user needs to tweak. - `bool linear_descriptor_pools_enabled = true` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. Setting it to false is required when workarounding driver bugs (e.g. Adreno 730). A ridiculous optimization. Ridiculous because the original code should've done this in the first place. Previously Godot was doing the following: 1. Create a command buffer **pool**. One per frame. 2. Create multiple command buffers from the pool in point 1. 3. Call `vkBeginCommandBuffer` on the cmd buffer in point 2. This resets the cmd buffer because Godot requests the `VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT` flag. 4. Add commands to the cmd buffers from point 2. 5. Submit those commands. 6. On frame N + 2, recycle the buffer pool and cmd buffers from pt 1 & 2, and repeat from step 3. The problem here is that step 3 resets each command buffer individually. Initially Godot used to have 1 cmd buffer per pool, thus the impact is very low. But not anymore (specially with Adreno workarounds to force splitting compute dispatches into a new cmd buffer, more on this later). However Godot keeps around a very low amount of command buffers per frame. The recommended method is to reset the whole pool, to reset all cmd buffers at once. Hence the new steps would be: 1. Create a command buffer **pool**. One per frame. 2. Create multiple command buffers from the pool in point 1. 3. Call `vkBeginCommandBuffer` on the cmd buffer in point 2, which is already reset/empty (see step 6). 4. Add commands to the cmd buffers from point 2. 5. Submit those commands. 6. On frame N + 2, recycle the buffer pool and cmd buffers from pt 1 & 2, call `vkResetCommandPool` and repeat from step 3. **Possible issues:** @DarioSamo added `transfer_worker` which creates a command buffer pool: ```cpp transfer_worker->command_pool = driver->command_pool_create(transfer_queue_family, RDD::COMMAND_BUFFER_TYPE_PRIMARY); ``` As expected, validation was complaining that command buffers were being reused without being reset (that's good, we now know Validation Layers will warn us of wrong use). I fixed it by adding: ```cpp void RenderingDevice::_wait_for_transfer_worker(TransferWorker *p_transfer_worker) { driver->fence_wait(p_transfer_worker->command_fence); driver->command_pool_reset(p_transfer_worker->command_pool); // ! New line ! ``` **Secondary cmd buffers are subject to the same issue but I didn't alter them. I talked this with Dario and he is aware of this.** Secondary cmd buffers are currently disabled due to other issues (it's disabled on master). - `bool RenderingDeviceCommons::command_pool_reset_enabled` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. There's no other reason for this boolean. Possibly once it becomes well tested, the boolean could be removed entirely. Adds `command_bind_render_uniform_sets` and `add_draw_list_bind_uniform_sets` (+ compute variants). It performs the same as `add_draw_list_bind_uniform_set` (notice singular vs plural), but on multiple consecutive uniform sets, thus reducing graph and draw call overhead. - `bool descriptor_set_batching = true;` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. There's no other reason for this boolean. Possibly once it becomes well tested, the boolean could be removed entirely. Godot currently does the following: 1. Fill the entire cmd buffer with commands. 2. `submit()` - Wait with a semaphore for the swapchain. - Trigger a semaphore to indicate when we're done (so the swapchain can submit). 3. `present()` The optimization opportunity here is that 95% of Godot's rendering is done offscreen. Then a fullscreen pass copies everything to the swapchain. Godot doesn't practically render directly to the swapchain. The problem with this is that the GPU has to wait for the swapchain to be released **to start anything**, when we could start *much earlier*. Only the final blit pass must wait for the swapchain. TheForge changed it to the following (more complicated, I'm simplifying the idea): 1. Fill the entire cmd buffer with commands. 2. In `screen_prepare_for_drawing` do `submit()` - There are no semaphore waits for the swapchain. - Trigger a semaphore to indicate when we're done. 3. Fill a new cmd buffer that only does the final blit to the swapchain. 4. `submit()` - Wait with a semaphore for the submit() from step 2. - Wait with a semaphore for the swapchain (so the swapchain can submit). - Trigger a semaphore to indicate when we're done (so the swapchain can submit). 5. `present()` Dario discovered this problem independently while working on a different platform. **However TheForge's solution had to be rewritten from scratch:** The complexity to achieve the solution was high and quite difficult to maintain with the way Godot works now (after Übershaders PR). But on the other hand, re-implementing the solution became much simpler because Dario already had to do something similar: To fix an Adreno 730 driver bug, he had to implement splitting command buffers. **This is exactly what we need!**. Thus it was re-written using this existing functionality for a new purpose. To achieve this, I added a new argument, `bool p_split_cmd_buffer`, to `RenderingDeviceGraph::add_draw_list_begin`, which is only set to true by `RenderingDevice::draw_list_begin_for_screen`. The graph will split the draw list into its own command buffer. - `bool split_swapchain_into_its_own_cmd_buffer = true;` Setting it to false enforces the old behavior. This might be necessary for consoles which follow an alternate solution to the same problem. If not, then we should consider removing it. PR godotengine#90993 added `shader_destroy_modules()` but it was not actually in use. This PR adds several places where `shader_destroy_modules()` is called after initialization to free up memory of SPIR-V structures that are no longer needed.

PR godotengine#90993 needed to get rid of VMA_MEMORY_USAGE_AUTO_PREFER_HOST because we no longer used vmaCreateBuffer so we could specify the allocation callbacks. This however resulted in the wrong memory pool being chosen, causing signficant performance slowdown. Indicate additional preferred flags to help VMA select the proper pool. Fixes godotengine#101905

@DarioSamo

The work was performed by collaboration of TheForge and Google. I am merely splitting it up into smaller PRs and cleaning it up. This is the most "risky" PR so far because the previous ones have been miscellaneous stuff aimed at either [improve debugging](godotengine#90993) (e.g. device lost), [improve Android experience](godotengine#96439) (add Swappy for better Frame Pacing + Pre-Transformed Swapchains for slightly better performance), or harmless [ASTC improvements](godotengine#96045) (better performance by simply toggling a feature when available). However this PR contains larger modifications aimed at improving performance or reducing memory fragmentation. With greater modifications, come greater risks of bugs or breakage. Changes introduced by this PR: TBDR GPUs (e.g. most of Android + iOS + M1 Apple) support rendering to Render Targets that are not backed by actual GPU memory (everything stays in cache). This works as long as load action isn't `LOAD`, and store action must be `DONT_CARE`. This saves VRAM (it also makes painfully obvious when a mistake introduces a performance regression). Of particular usefulness is when doing MSAA and keeping the raw MSAA content is not necessary. Some GPUs get faster when the sampler settings are hard-coded into the GLSL shaders (instead of being dynamically bound at runtime). This required changes to the GLSL shaders, PSO creation routines, Descriptor creation routines, and Descriptor binding routines. - `bool immutable_samplers_enabled = true` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. Immutable samplers requires that the samplers stay... immutable, hence this boolean is useful if the promise gets broken. We might want to turn this into a `GLOBAL_DEF` setting. Instead of creating dozen/hundreds/thousands of `VkDescriptorSet` every frame that need to be freed individually when they are no longer needed, they all get freed at once by resetting the whole pool. Once the whole pool is no longer in use by the GPU, it gets reset and its memory recycled. Descriptor sets that are created to be kept around for longer or forever (i.e. not created and freed within the same frame) **must not** use linear pools. There may be more than one pool per frame. How many pools per frame Godot ends up with depends on its capacity, and that is controlled by `rendering/rendering_device/vulkan/max_descriptors_per_pool`. - **Possible improvement for later:** It should be possible for Godot to adapt to how many descriptors per pool are needed on a per-key basis (i.e. grow their capacity like `std::vector` does) after rendering a few frames; which would be better than the current solution of having a single global value for all pools (`max_descriptors_per_pool`) that the user needs to tweak. - `bool linear_descriptor_pools_enabled = true` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. Setting it to false is required when workarounding driver bugs (e.g. Adreno 730). A ridiculous optimization. Ridiculous because the original code should've done this in the first place. Previously Godot was doing the following: 1. Create a command buffer **pool**. One per frame. 2. Create multiple command buffers from the pool in point 1. 3. Call `vkBeginCommandBuffer` on the cmd buffer in point 2. This resets the cmd buffer because Godot requests the `VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT` flag. 4. Add commands to the cmd buffers from point 2. 5. Submit those commands. 6. On frame N + 2, recycle the buffer pool and cmd buffers from pt 1 & 2, and repeat from step 3. The problem here is that step 3 resets each command buffer individually. Initially Godot used to have 1 cmd buffer per pool, thus the impact is very low. But not anymore (specially with Adreno workarounds to force splitting compute dispatches into a new cmd buffer, more on this later). However Godot keeps around a very low amount of command buffers per frame. The recommended method is to reset the whole pool, to reset all cmd buffers at once. Hence the new steps would be: 1. Create a command buffer **pool**. One per frame. 2. Create multiple command buffers from the pool in point 1. 3. Call `vkBeginCommandBuffer` on the cmd buffer in point 2, which is already reset/empty (see step 6). 4. Add commands to the cmd buffers from point 2. 5. Submit those commands. 6. On frame N + 2, recycle the buffer pool and cmd buffers from pt 1 & 2, call `vkResetCommandPool` and repeat from step 3. **Possible issues:** @DarioSamo added `transfer_worker` which creates a command buffer pool: ```cpp transfer_worker->command_pool = driver->command_pool_create(transfer_queue_family, RDD::COMMAND_BUFFER_TYPE_PRIMARY); ``` As expected, validation was complaining that command buffers were being reused without being reset (that's good, we now know Validation Layers will warn us of wrong use). I fixed it by adding: ```cpp void RenderingDevice::_wait_for_transfer_worker(TransferWorker *p_transfer_worker) { driver->fence_wait(p_transfer_worker->command_fence); driver->command_pool_reset(p_transfer_worker->command_pool); // ! New line ! ``` **Secondary cmd buffers are subject to the same issue but I didn't alter them. I talked this with Dario and he is aware of this.** Secondary cmd buffers are currently disabled due to other issues (it's disabled on master). - `bool RenderingDeviceCommons::command_pool_reset_enabled` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. There's no other reason for this boolean. Possibly once it becomes well tested, the boolean could be removed entirely. Adds `command_bind_render_uniform_sets` and `add_draw_list_bind_uniform_sets` (+ compute variants). It performs the same as `add_draw_list_bind_uniform_set` (notice singular vs plural), but on multiple consecutive uniform sets, thus reducing graph and draw call overhead. - `bool descriptor_set_batching = true;` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. There's no other reason for this boolean. Possibly once it becomes well tested, the boolean could be removed entirely. Godot currently does the following: 1. Fill the entire cmd buffer with commands. 2. `submit()` - Wait with a semaphore for the swapchain. - Trigger a semaphore to indicate when we're done (so the swapchain can submit). 3. `present()` The optimization opportunity here is that 95% of Godot's rendering is done offscreen. Then a fullscreen pass copies everything to the swapchain. Godot doesn't practically render directly to the swapchain. The problem with this is that the GPU has to wait for the swapchain to be released **to start anything**, when we could start *much earlier*. Only the final blit pass must wait for the swapchain. TheForge changed it to the following (more complicated, I'm simplifying the idea): 1. Fill the entire cmd buffer with commands. 2. In `screen_prepare_for_drawing` do `submit()` - There are no semaphore waits for the swapchain. - Trigger a semaphore to indicate when we're done. 3. Fill a new cmd buffer that only does the final blit to the swapchain. 4. `submit()` - Wait with a semaphore for the submit() from step 2. - Wait with a semaphore for the swapchain (so the swapchain can submit). - Trigger a semaphore to indicate when we're done (so the swapchain can submit). 5. `present()` Dario discovered this problem independently while working on a different platform. **However TheForge's solution had to be rewritten from scratch:** The complexity to achieve the solution was high and quite difficult to maintain with the way Godot works now (after Übershaders PR). But on the other hand, re-implementing the solution became much simpler because Dario already had to do something similar: To fix an Adreno 730 driver bug, he had to implement splitting command buffers. **This is exactly what we need!**. Thus it was re-written using this existing functionality for a new purpose. To achieve this, I added a new argument, `bool p_split_cmd_buffer`, to `RenderingDeviceGraph::add_draw_list_begin`, which is only set to true by `RenderingDevice::draw_list_begin_for_screen`. The graph will split the draw list into its own command buffer. - `bool split_swapchain_into_its_own_cmd_buffer = true;` Setting it to false enforces the old behavior. This might be necessary for consoles which follow an alternate solution to the same problem. If not, then we should consider removing it. PR godotengine#90993 added `shader_destroy_modules()` but it was not actually in use. This PR adds several places where `shader_destroy_modules()` is called after initialization to free up memory of SPIR-V structures that are no longer needed.

PR godotengine#90993 needed to get rid of VMA_MEMORY_USAGE_AUTO_PREFER_HOST because we no longer used vmaCreateBuffer so we could specify the allocation callbacks. This however resulted in the wrong memory pool being chosen, causing signficant performance slowdown. Indicate additional preferred flags to help VMA select the proper pool. Fixes godotengine#101905

@DarioSamo

The work was performed by collaboration of TheForge and Google. I am merely splitting it up into smaller PRs and cleaning it up. This is the most "risky" PR so far because the previous ones have been miscellaneous stuff aimed at either [improve debugging](godotengine#90993) (e.g. device lost), [improve Android experience](godotengine#96439) (add Swappy for better Frame Pacing + Pre-Transformed Swapchains for slightly better performance), or harmless [ASTC improvements](godotengine#96045) (better performance by simply toggling a feature when available). However this PR contains larger modifications aimed at improving performance or reducing memory fragmentation. With greater modifications, come greater risks of bugs or breakage. Changes introduced by this PR: TBDR GPUs (e.g. most of Android + iOS + M1 Apple) support rendering to Render Targets that are not backed by actual GPU memory (everything stays in cache). This works as long as load action isn't `LOAD`, and store action must be `DONT_CARE`. This saves VRAM (it also makes painfully obvious when a mistake introduces a performance regression). Of particular usefulness is when doing MSAA and keeping the raw MSAA content is not necessary. Some GPUs get faster when the sampler settings are hard-coded into the GLSL shaders (instead of being dynamically bound at runtime). This required changes to the GLSL shaders, PSO creation routines, Descriptor creation routines, and Descriptor binding routines. - `bool immutable_samplers_enabled = true` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. Immutable samplers requires that the samplers stay... immutable, hence this boolean is useful if the promise gets broken. We might want to turn this into a `GLOBAL_DEF` setting. Instead of creating dozen/hundreds/thousands of `VkDescriptorSet` every frame that need to be freed individually when they are no longer needed, they all get freed at once by resetting the whole pool. Once the whole pool is no longer in use by the GPU, it gets reset and its memory recycled. Descriptor sets that are created to be kept around for longer or forever (i.e. not created and freed within the same frame) **must not** use linear pools. There may be more than one pool per frame. How many pools per frame Godot ends up with depends on its capacity, and that is controlled by `rendering/rendering_device/vulkan/max_descriptors_per_pool`. - **Possible improvement for later:** It should be possible for Godot to adapt to how many descriptors per pool are needed on a per-key basis (i.e. grow their capacity like `std::vector` does) after rendering a few frames; which would be better than the current solution of having a single global value for all pools (`max_descriptors_per_pool`) that the user needs to tweak. - `bool linear_descriptor_pools_enabled = true` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. Setting it to false is required when workarounding driver bugs (e.g. Adreno 730). A ridiculous optimization. Ridiculous because the original code should've done this in the first place. Previously Godot was doing the following: 1. Create a command buffer **pool**. One per frame. 2. Create multiple command buffers from the pool in point 1. 3. Call `vkBeginCommandBuffer` on the cmd buffer in point 2. This resets the cmd buffer because Godot requests the `VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT` flag. 4. Add commands to the cmd buffers from point 2. 5. Submit those commands. 6. On frame N + 2, recycle the buffer pool and cmd buffers from pt 1 & 2, and repeat from step 3. The problem here is that step 3 resets each command buffer individually. Initially Godot used to have 1 cmd buffer per pool, thus the impact is very low. But not anymore (specially with Adreno workarounds to force splitting compute dispatches into a new cmd buffer, more on this later). However Godot keeps around a very low amount of command buffers per frame. The recommended method is to reset the whole pool, to reset all cmd buffers at once. Hence the new steps would be: 1. Create a command buffer **pool**. One per frame. 2. Create multiple command buffers from the pool in point 1. 3. Call `vkBeginCommandBuffer` on the cmd buffer in point 2, which is already reset/empty (see step 6). 4. Add commands to the cmd buffers from point 2. 5. Submit those commands. 6. On frame N + 2, recycle the buffer pool and cmd buffers from pt 1 & 2, call `vkResetCommandPool` and repeat from step 3. **Possible issues:** @DarioSamo added `transfer_worker` which creates a command buffer pool: ```cpp transfer_worker->command_pool = driver->command_pool_create(transfer_queue_family, RDD::COMMAND_BUFFER_TYPE_PRIMARY); ``` As expected, validation was complaining that command buffers were being reused without being reset (that's good, we now know Validation Layers will warn us of wrong use). I fixed it by adding: ```cpp void RenderingDevice::_wait_for_transfer_worker(TransferWorker *p_transfer_worker) { driver->fence_wait(p_transfer_worker->command_fence); driver->command_pool_reset(p_transfer_worker->command_pool); // ! New line ! ``` **Secondary cmd buffers are subject to the same issue but I didn't alter them. I talked this with Dario and he is aware of this.** Secondary cmd buffers are currently disabled due to other issues (it's disabled on master). - `bool RenderingDeviceCommons::command_pool_reset_enabled` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. There's no other reason for this boolean. Possibly once it becomes well tested, the boolean could be removed entirely. Adds `command_bind_render_uniform_sets` and `add_draw_list_bind_uniform_sets` (+ compute variants). It performs the same as `add_draw_list_bind_uniform_set` (notice singular vs plural), but on multiple consecutive uniform sets, thus reducing graph and draw call overhead. - `bool descriptor_set_batching = true;` Setting it to false enforces the old behavior. Useful for debugging bugs and regressions. There's no other reason for this boolean. Possibly once it becomes well tested, the boolean could be removed entirely. Godot currently does the following: 1. Fill the entire cmd buffer with commands. 2. `submit()` - Wait with a semaphore for the swapchain. - Trigger a semaphore to indicate when we're done (so the swapchain can submit). 3. `present()` The optimization opportunity here is that 95% of Godot's rendering is done offscreen. Then a fullscreen pass copies everything to the swapchain. Godot doesn't practically render directly to the swapchain. The problem with this is that the GPU has to wait for the swapchain to be released **to start anything**, when we could start *much earlier*. Only the final blit pass must wait for the swapchain. TheForge changed it to the following (more complicated, I'm simplifying the idea): 1. Fill the entire cmd buffer with commands. 2. In `screen_prepare_for_drawing` do `submit()` - There are no semaphore waits for the swapchain. - Trigger a semaphore to indicate when we're done. 3. Fill a new cmd buffer that only does the final blit to the swapchain. 4. `submit()` - Wait with a semaphore for the submit() from step 2. - Wait with a semaphore for the swapchain (so the swapchain can submit). - Trigger a semaphore to indicate when we're done (so the swapchain can submit). 5. `present()` Dario discovered this problem independently while working on a different platform. **However TheForge's solution had to be rewritten from scratch:** The complexity to achieve the solution was high and quite difficult to maintain with the way Godot works now (after Übershaders PR). But on the other hand, re-implementing the solution became much simpler because Dario already had to do something similar: To fix an Adreno 730 driver bug, he had to implement splitting command buffers. **This is exactly what we need!**. Thus it was re-written using this existing functionality for a new purpose. To achieve this, I added a new argument, `bool p_split_cmd_buffer`, to `RenderingDeviceGraph::add_draw_list_begin`, which is only set to true by `RenderingDevice::draw_list_begin_for_screen`. The graph will split the draw list into its own command buffer. - `bool split_swapchain_into_its_own_cmd_buffer = true;` Setting it to false enforces the old behavior. This might be necessary for consoles which follow an alternate solution to the same problem. If not, then we should consider removing it. PR godotengine#90993 added `shader_destroy_modules()` but it was not actually in use. This PR adds several places where `shader_destroy_modules()` is called after initialization to free up memory of SPIR-V structures that are no longer needed.

PR godotengine#90993 needed to get rid of VMA_MEMORY_USAGE_AUTO_PREFER_HOST because we no longer used vmaCreateBuffer so we could specify the allocation callbacks. This however resulted in the wrong memory pool being chosen, causing signficant performance slowdown. Indicate additional preferred flags to help VMA select the proper pool. Fixes godotengine#101905

darksylinc requested review from a team as code owners April 21, 2024 23:21

darksylinc force-pushed the matias-TheForge branch from 3cb8926 to f0c3cf7 Compare April 21, 2024 23:48

DarioSamo reviewed Apr 21, 2024

View reviewed changes

darksylinc force-pushed the matias-TheForge branch 3 times, most recently from 36bb0c3 to e3cc35b Compare April 22, 2024 02:25

AThousandShips added enhancement feature proposal topic:rendering labels Apr 22, 2024

AThousandShips added this to the 4.x milestone Apr 22, 2024

bruvzg reviewed Apr 22, 2024

View reviewed changes

drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved

AThousandShips reviewed Apr 22, 2024

View reviewed changes

darksylinc force-pushed the matias-TheForge branch 5 times, most recently from d71ead1 to a7aa977 Compare April 27, 2024 20:25

darksylinc force-pushed the matias-TheForge branch 3 times, most recently from 7e15b93 to 23c4298 Compare April 28, 2024 16:13

AThousandShips reviewed Apr 28, 2024

View reviewed changes

darksylinc force-pushed the matias-TheForge branch from 23c4298 to 41c9fb4 Compare April 28, 2024 18:41

clayjohn mentioned this pull request Dec 11, 2024

RenderingDevice - texture functions are no longer allowed in background threads due to thread guards. #99750

Closed

Spartan322 mentioned this pull request Dec 12, 2024

[4.3] Add Jolt Physics as an alternative 3D physics engine Redot-Engine/redot-engine#896

Open

akien-mga mentioned this pull request Dec 17, 2024

Only allow valid types in Decal, Light3D projector, PointLight2D texture and CSGMesh3D mesh #88349

Merged

matheusmdx mentioned this pull request Jan 23, 2025

Stuttering issue when using Godot 4.4 beta 1 #101905

Closed

darksylinc mentioned this pull request Jan 24, 2025

Vulkan: Fix performance regression introduced in buffer creation #101972

Merged

clayjohn mentioned this pull request Jan 28, 2025

Godot 4.4 doesn't allocate shared GPU memory correctly #101850

Closed

This was referenced Jan 29, 2025

janiskirsteins/tvos export jkirsteins/godot#1

Draft

janiskirsteins/fix ios joypad init jkirsteins/godot#2

Closed

clayjohn mentioned this pull request Feb 14, 2025

Restore using VMA to create buffers and images #102830

Merged

jkirsteins mentioned this pull request Feb 20, 2025

[proof-of-concept] closed jkirsteins/godot#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debug utilities for Vulkan #90993

Add debug utilities for Vulkan #90993

darksylinc commented Apr 21, 2024 •

edited

Loading

DarioSamo Apr 21, 2024

darksylinc Apr 21, 2024

clayjohn Apr 22, 2024

darksylinc Apr 27, 2024 •

edited

Loading

clayjohn May 15, 2024

DarioSamo Apr 21, 2024

darksylinc Apr 22, 2024

darksylinc Apr 22, 2024

darksylinc commented Apr 27, 2024

Add debug utilities for Vulkan #90993

Add debug utilities for Vulkan #90993

Conversation

darksylinc commented Apr 21, 2024 • edited Loading

PR comments

Update

DarioSamo Apr 21, 2024

Choose a reason for hiding this comment

darksylinc Apr 21, 2024

Choose a reason for hiding this comment

clayjohn Apr 22, 2024

Choose a reason for hiding this comment

darksylinc Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

clayjohn May 15, 2024

Choose a reason for hiding this comment

DarioSamo Apr 21, 2024

Choose a reason for hiding this comment

darksylinc Apr 22, 2024

Choose a reason for hiding this comment

darksylinc Apr 22, 2024

Choose a reason for hiding this comment

darksylinc commented Apr 27, 2024

darksylinc commented Apr 21, 2024 •

edited

Loading

darksylinc Apr 27, 2024 •

edited

Loading