-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False-positive READ-AFTER-WRITE hazard and incorrect resource label #8291
Comments
@StefanPoelloth thanks for the report. "Resource" field it's a new feature. Previously submit time validation did not have access to resource handles. Now when the race condition is detected in some memory location it has access to the resource associated with this memory region. API dump would be helpful, can you upload it to LunarG sharing portal https://share.lunarg.com/ ? Any chance that application uses Vulkan memory aliasing (the same memory object is bound to multiple resources)? |
@artem-lunarg Yes we use VMA for allocations. Ive just created an account, but it tells me "You are not setup to access File Share.". |
@StefanPoelloth sorry for the confusing, it should be done in a different way. Could you provide your email address, it will be used to create invitation for upload. |
@artem-lunarg needs to create a folder for sharing files and then invite you (via email) to the folder. |
The invitation is sent. |
Thanks, I received the API dump. |
@StefanPoelloth Is it possible to create a gfx reconstruct capture (available as Frame Capture layer in vkconfig), so I can debug actual Vulkan command. Assuming it's not too sensitive for sharing. |
@artem-lunarg Ive uploaded a gfx reconstruct capture. I had to disable page_guard though, with page_guard the process just froze before displaying an image. Hope thats fine. |
Thanks, I can run the capture. Unfortunately I can not reproduce the issue (no sync validation errrors). Tried both the latest VVL code and commit mentioned in the issue description. Maybe there is some difference comparing to running actual app. If no luck will use api dump for investigation, that's also very helpful. |
@artem-lunarg Ive uploaded a longer capture with the validation errors logged. The problem is a bit difficult to reproduce but i made sure that the error was happening at least 3 times while recording the capture file. |
@StefanPoelloth from the attached log file it looks like GPU-Assisted validation is enabled ( |
@artem-lunarg i couldnt reproduce it without gpu assisted validation. I did always run it with gpu av until now. |
Okay, then it might be poor interaction between SyncVal and GPU-AV. GPU assisted validation instruments shaders and adds new descriptors sets. These additional descriptor sets are not accounted by SyncVal and it can misinterpret resources used by those sets. Timeline semaphore feature used by GPU-AV is also not supported by SyncVal yet (support is planned later this year). It's possible that entire SyncVal message is a false-positive, not only "resource" field is wrong. Sorry for this issue, but currently it's not much guarantees how GPU-AV interacts with SyncVal. Most of the efforts is to provide solid baseline for each of them separately. Hopefully we can improve interaction in the future. |
@artem-lunarg I was able to reproduce it with
validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x1e508c5f450, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x1e512ffd2d0[], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 9456, resource: VkBuffer 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x1e508c5f450[], submit: 18912, batch: 0, batch_tag: 714184, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x1e513088590[], seq_no: 5, reset_no: 9452, resource: VkBuffer 0xad2b50000000316[PrimaryCullPass occludeds]).
|
@StefanPoelloth Thanks for the confirmation. I will continue to look into this. If that's a bug it would be nice to fix it for the new SDK. |
@StefanPoelloth So far I can't reproduce the issue, will spend some time on the analysis of the code but if no luck it might stuck for a while until we can get a good repro case. The "resource" field is a new feature, but hazard detection should not be affected by it (you can ignore "resource" part of the message). It still might be a good idea to check if the reported race condition actually happens. It says that after the initial write to a buffer the barrier was set that allows VERTEX/COMPUTE shader to READ, but the last access (submitted usage) was COPY READ (source parameter of vkCmdCopyBuffer) and the COPY stage is not protected by the barrier. There is a chance that |
Can confirm using the API dump that initial write (prior_usage: SYNC_COPY_TRANSFER_WRITE, seq_no: 5, reset_no: 8) was into |
@artem-lunarg Yes, the We have done extensive analysis and I'm pretty confident that all barriers (for buffer 0x17758085a10[GpuArray MatrixStore _primaryArray]) are correctly placed.
validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x17755332de0, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x17757fb6170[Frame 7], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 8, resource: VkBuffer 0x17758085a10[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x17755332de0[], submit: 16, batch: 0, batch_tag: 326, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x177580720d0[Frame 6], seq_no: 5, reset_no: 4, resource: VkBuffer 0x17758098eb0[PrimaryCullPass occludedCount]).
My analysis shows that the buffer My API dump: dump6.zip My complete dump analysis for 0x17758085a10 for the 7 frames:
barrier format is: srcStageMask/srcAccessMask -> dstStageMask/dstAccessMask
frame 1:
frame 2:
frame 3:
frame 4:
frame 5:
frame 6:
frame 7:
I think the GPU-AV tag should be removed since its happening without GPU-AV. |
@StefanPoelloth Thanks for details! I also have done analysis of the barriers based on original dump and it looks correct for me. So it could be two problems here. False-positive error and incorrect labeling of the resource. Additional documentation. Critical sequence from original dump for buffer 6EEA0C0000000185 that syncval complains about:
|
@StefanPoelloth could you clarify few points: |
@artem-lunarg Some context:
The last point results in the following updates (a random run, not related to any capture):
b) I can reliable reproduce it in the first few (~20) frames on every application run. Usually it happens between frame 3 and 8. EDIT:
|
thanks! |
I wrote a test that simulates how @StefanPoelloth if you have opportunity to check the latest VVL code (new SDK will also be released soon) it would be interesting if it changes something. In the latest code syncval validation of descriptor accesses is disabled by default, because it can produce false-positives (old behavior can be enabled with setting Pseudo code I use for investigation for documentation purposes: |
@artem-lunarg Ive tested with a6d3fc5 yesterday and today with edcf314 and i can confirm its happening with "submit time validation" enabled and "shader access heuristic" disabled. Its not happening with "submit time validation" disabled and "shader access heuristic" enabled. Ive tried to remove rendering code step by step and if i remove all rendering passes (draw, dispatch and everything related) and only leave in the commandbuffer creation, updating buffers and EndCommandBuffer. When i do that i get the following assertion: Callstack:
> VkLayer_khronos_validation.dll!std::vector>::operator[](const unsigned __int64 _Pos) Line 1899 C++
VkLayer_khronos_validation.dll!CommandBufferAccessContext::GetHandleRecord(unsigned int handle_index) Line 334 C++
VkLayer_khronos_validation.dll!operator<<(std::basic_ostream> & out, const ResourceUsageRecord::FormatterState & formatter) Line 1204 C++
VkLayer_khronos_validation.dll!QueueBatchContext::FormatUsage(ResourceUsageTagEx tag_ex) Line 464 C++
VkLayer_khronos_validation.dll!SyncValidationInfo::FormatHazard(const HazardResult & hazard) Line 1255 C++
VkLayer_khronos_validation.dll!ReplayState::DetectFirstUseHazard(const sparse_container::range & first_use_range) Line 1235 C++
VkLayer_khronos_validation.dll!ReplayState::ValidateFirstUse() Line 1255 C++
VkLayer_khronos_validation.dll!QueueBatchContext::ValidateSubmit(const VkSubmitInfo2 & submit, unsigned __int64 submit_index, unsigned int batch_index, std::vector> & current_label_stack, const ErrorObject & error_obj) Line 534 C++
VkLayer_khronos_validation.dll!SyncValidator::ValidateQueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo2 * pSubmits, VkFence_T * fence, const ErrorObject & error_obj) Line 3079 C++
VkLayer_khronos_validation.dll!SyncValidator::PreCallValidateQueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo * pSubmits, VkFence_T * fence, const ErrorObject & error_obj) Line 3043 C++
VkLayer_khronos_validation.dll!vulkan_layer_chassis::QueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo * pSubmits, VkFence_T * fence) Line 1490 C++
vulkan-1.dll!00007ff9f708e01a() Unknown
[Managed to Native Transition]
Silk.NET.Vulkan.dll!Silk.NET.Vulkan.Vk.QueueSubmit(Silk.NET.Vulkan.Queue queue, uint submitCount, Silk.NET.Vulkan.SubmitInfo pSubmits, Silk.NET.Vulkan.Fence fence) Line 10635 C#
FS.VulkanRenderer.dll!FS.VulkanRenderer.VulkanRenderer.Draw(FS.VulkanRenderer.Scene.RenderScene renderScene, FS.VulkanRenderer.FrameInfo frameInfo, FS.VulkanRenderer.ImGuiImpl.Snapshot? imGuiSnapshot) Line 224 C#
RenderDemo.dll!RenderDemo.Program.OnFrame(double delta) Line 334 C#
Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.Internals.ViewImplementationBase.DoRender() Unknown
Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.WindowExtensions.Run.AnonymousMethod__0() Unknown
Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.Internals.ViewImplementationBase.Run(System.Action onFrame) Unknown
Silk.NET.Windowing.Glfw.dll!Silk.NET.Windowing.Glfw.GlfwWindow.Run(System.Action onFrame) Unknown
Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.WindowExtensions.Run(Silk.NET.Windowing.IView view) Unknown
RenderDemo.dll!RenderDemo.Program.Main() Line 225 C#
[Native to Managed Transition]
hostpolicy.dll!00007ff9545d2b76() Unknown
hostpolicy.dll!00007ff9545d2e5c() Unknown
hostpolicy.dll!00007ff9545d379a() Unknown
hostfxr.dll!00007ff95462b5c9() Unknown
hostfxr.dll!00007ff95462e066() Unknown
hostfxr.dll!00007ff9546302ec() Unknown
hostfxr.dll!00007ff95462e644() Unknown
hostfxr.dll!00007ff9546285a0() Unknown
RenderDemo.exe!00007ff7d84cf998() Unknown
RenderDemo.exe!00007ff7d84cfda6() Unknown
RenderDemo.exe!00007ff7d84d12e8() Unknown
kernel32.dll!00007ffa49ca7344() Unknown
ntdll.dll!00007ffa49efcc91() Unknown
|
@StefanPoelloth Thanks for the confirmation, that's very helpful. Yes, this entire issue is related to "submit time validation" so makes sense to test with this option always enabled. It's valuable information that it is happening with "shader accesses" disabled - simplifies testing scenarios we might need to consider. About assertion, I suspected this scenario and was going to disable "resource" reporting for this SDK release, and it's one more good confirmation. p.s. I won't be available for the next few weeks, so might be not much progress here and probably won't be fixed for this SDK, but that's something we definitely will target to fix for the SDK after that. |
KhronosGroup#8291 This can be related to command buffer lifetime management. Usage records can reference handles from the command buffers that were Reset.
@StefanPoelloth do you run syncval alone or together with Core validation enabled (core checkbox)? Any Core validation errors? Core validation errors can put syncval into inconsistent state and it can produce false-positives. The recommendation is to run core validation error free before enabling syncval. Syncval should not crash in this scenario though. |
@artem-lunarg Core validation is disabled, everything except "Synchronization" and "Submit time validation" is disabled. |
Thanks, "Handle Wrapping" is good to have enabled though, it's a less tested path when it is disabled. |
#8291 This can be related to command buffer lifetime management. Usage records can reference handles from the command buffers that were Reset.
#8291 This can be related to command buffer lifetime management. Usage records can reference handles from the command buffers that were Reset.
@artem-lunarg Ive reduces the amount of api calls drastically and uploaded a new api dump to the share. Were working on repro code that i can share, but unfortunately we had no luck yet. |
@StefanPoelloth thanks for putting efforts into this. I have some ideas what could be wrong with the label and trying to come up with a repro case. It's more tricky with the reported error itself. Both API-dump and error message suggests that there was a barrier before copy so READs should be protected from previous WRITEs but somehow it did not work. |
@StefanPoelloth if that's something that can be quickly hacked (I know in some engines it can be tricky to do, then please ignore). If the program runs without present operations (no QueuePresent, no AcquireNextImage, QueueSubmit is adjusted not to wait semaphore from Acquire and not to signal semaphore for present). I wonder if this scenario also reproduces the error (and incorrect label). So far I tested without presentation. |
@artem-lunarg I gave it a quick try and it doesnt seem to happen without present/semaphores. |
@artem-lunarg Ive uploaded a sample project that reproduces the issue, checkout the included readme. |
@StefanPoelloth Thank you, I can see the assert from out of bounds access. One question, I'm not a .net user, is it possible to continue debugging VVL code after the assert is hit? When I press Retry button in assert window, that usually goes into the debugger for c++ project, here it terminates the app. |
@artem-lunarg Im assuming you use visual studio: Right click on the RenderDemo project and select Properties. this should create the following file: RenderDemo/Properties/launchSettings.json It didnt hit an assert for me with a debug build of 9195994 🤔 |
Thanks, debuging works now. I'm using debug build. The assert I get is exactly what we need, it's the |
in the latest code we disabled resource reporting, that's probably the reason why there is no assert and only the validation error. |
@artem-lunarg By changing the code very slightly, I was able to produce a WAW instead of RAW. Ive uploaded the repro code for this as well. |
Thank you @StefanPoelloth. We successfully reproduced the scenario within our testing framework with a compact unit test. I’ll keel posting updates here when we have fixes. |
@StefanPoelloth the fix is landed. If it does not fix the issue in your program, please reopen this ticket. |
@artem-lunarg I can confirm this fixes the RAW and WAW false positives. Thanks |
Environment:
Describe the Issue
I can provide an api dump privately if needed.
The message is invalid/incorrect, it specifies 2 different buffers 0x6eea0c0000000185 and 0x7d8104000000018d which are both alive (not destroyed).
The prior_usage specifies that buffer 0x7d8104000000018d was used with vkCmdCopyBuffer, which is wrong.
Using the API dump I made sure that no vkCmdCopyBuffer was called for buffer 0x7d8104000000018d (its a mapped buffer, updated with the mapped pointer and used in CmdDispatchIndirect).
Expected behavior
Either a correct sync hazard message or no message.
Valid Usage ID
validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x2d4ecb7c6c0, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x2d4ef7fd510[], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 10, resource: VkBuffer 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x2d4ecb7c6c0[], submit: 22, batch: 0, batch_tag: 571, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x2d4ef775860[], seq_no: 5, reset_no: 8, resource: VkBuffer 0x7d8104000000018d[PrimaryCullPass occludedCount]).
The text was updated successfully, but these errors were encountered: