Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False-positive READ-AFTER-WRITE hazard and incorrect resource label #8291

Closed
StefanPoelloth opened this issue Jul 15, 2024 · 41 comments · Fixed by #8331
Closed

False-positive READ-AFTER-WRITE hazard and incorrect resource label #8291

StefanPoelloth opened this issue Jul 15, 2024 · 41 comments · Fixed by #8331
Assignees
Labels
Synchronization Synchronization Validation Object Issue

Comments

@StefanPoelloth
Copy link

Environment:

  • OS: Windows 10 22H2
  • GPU and driver version: 2080Ti 556.12
  • SDK or header version if building from repo: d0a37c6 (and 1.3.283)
  • Options enabled (synchronization, best practices, etc.): synchronization, queue submit validation

Describe the Issue

I can provide an api dump privately if needed.
The message is invalid/incorrect, it specifies 2 different buffers 0x6eea0c0000000185 and 0x7d8104000000018d which are both alive (not destroyed).
The prior_usage specifies that buffer 0x7d8104000000018d was used with vkCmdCopyBuffer, which is wrong.
Using the API dump I made sure that no vkCmdCopyBuffer was called for buffer 0x7d8104000000018d (its a mapped buffer, updated with the mapped pointer and used in CmdDispatchIndirect).

Expected behavior

Either a correct sync hazard message or no message.

Valid Usage ID

validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x2d4ecb7c6c0, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x2d4ef7fd510[], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 10, resource: VkBuffer 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x2d4ecb7c6c0[], submit: 22, batch: 0, batch_tag: 571, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x2d4ef775860[], seq_no: 5, reset_no: 8, resource: VkBuffer 0x7d8104000000018d[PrimaryCullPass occludedCount]).

@artem-lunarg
Copy link
Contributor

@StefanPoelloth thanks for the report. "Resource" field it's a new feature. Previously submit time validation did not have access to resource handles. Now when the race condition is detected in some memory location it has access to the resource associated with this memory region. API dump would be helpful, can you upload it to LunarG sharing portal https://share.lunarg.com/ ?

Any chance that application uses Vulkan memory aliasing (the same memory object is bound to multiple resources)?

@artem-lunarg artem-lunarg added the Synchronization Synchronization Validation Object Issue label Jul 15, 2024
@StefanPoelloth
Copy link
Author

@artem-lunarg Yes we use VMA for allocations. Ive just created an account, but it tells me "You are not setup to access File Share.".

@artem-lunarg
Copy link
Contributor

@StefanPoelloth sorry for the confusing, it should be done in a different way. Could you provide your email address, it will be used to create invitation for upload.

@KarenGhavam-lunarG
Copy link
Contributor

Yes we use VMA for allocations. Ive just created an account, but it tells me "You are not setup to access File Share.".

@artem-lunarg needs to create a folder for sharing files and then invite you (via email) to the folder.

@artem-lunarg
Copy link
Contributor

The invitation is sent.

@artem-lunarg
Copy link
Contributor

Thanks, I received the API dump.

@artem-lunarg
Copy link
Contributor

@StefanPoelloth Is it possible to create a gfx reconstruct capture (available as Frame Capture layer in vkconfig), so I can debug actual Vulkan command. Assuming it's not too sensitive for sharing.

@StefanPoelloth
Copy link
Author

@artem-lunarg Ive uploaded a gfx reconstruct capture. I had to disable page_guard though, with page_guard the process just froze before displaying an image. Hope thats fine.

@artem-lunarg
Copy link
Contributor

Thanks, I can run the capture. Unfortunately I can not reproduce the issue (no sync validation errrors). Tried both the latest VVL code and commit mentioned in the issue description. Maybe there is some difference comparing to running actual app. If no luck will use api dump for investigation, that's also very helpful.

@StefanPoelloth
Copy link
Author

@artem-lunarg Ive uploaded a longer capture with the validation errors logged. The problem is a bit difficult to reproduce but i made sure that the error was happening at least 3 times while recording the capture file.

@artem-lunarg
Copy link
Contributor

@StefanPoelloth from the attached log file it looks like GPU-Assisted validation is enabled (validation layer: Validation Warning: [ WARNING-GPU-Assisted-Validation ]). Usually the recommendation is to run Synchronization Validation separately, in theory it should work together but in practice it's not well tested. Is synchronization validation error reproducible if only Synchronization preset is enabled?

@artem-lunarg artem-lunarg self-assigned this Jul 16, 2024
@StefanPoelloth
Copy link
Author

StefanPoelloth commented Jul 16, 2024

@artem-lunarg i couldnt reproduce it without gpu assisted validation. I did always run it with gpu av until now.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Jul 16, 2024

Okay, then it might be poor interaction between SyncVal and GPU-AV. GPU assisted validation instruments shaders and adds new descriptors sets. These additional descriptor sets are not accounted by SyncVal and it can misinterpret resources used by those sets. Timeline semaphore feature used by GPU-AV is also not supported by SyncVal yet (support is planned later this year). It's possible that entire SyncVal message is a false-positive, not only "resource" field is wrong. Sorry for this issue, but currently it's not much guarantees how GPU-AV interacts with SyncVal. Most of the efforts is to provide solid baseline for each of them separately. Hopefully we can improve interaction in the future.

@artem-lunarg artem-lunarg changed the title Incorrect SYNC-HAZARD-READ-AFTER-WRITE message SyncVal false-positive when GPU-AV is enabled Jul 16, 2024
@spencer-lunarg spencer-lunarg added the GPU-AV GPU Assisted Validation label Jul 16, 2024
@StefanPoelloth
Copy link
Author

@artem-lunarg I was able to reproduce it with GPU_BASED_NONE, it just took much longer. I used these settings:
vk_layer_settings.txt

validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x1e508c5f450, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x1e512ffd2d0[], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 9456, resource: VkBuffer 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x1e508c5f450[], submit: 18912, batch: 0, batch_tag: 714184, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x1e513088590[], seq_no: 5, reset_no: 9452, resource: VkBuffer 0xad2b50000000316[PrimaryCullPass occludeds]).

@artem-lunarg
Copy link
Contributor

@StefanPoelloth Thanks for the confirmation. I will continue to look into this. If that's a bug it would be nice to fix it for the new SDK.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Jul 17, 2024

@StefanPoelloth So far I can't reproduce the issue, will spend some time on the analysis of the code but if no luck it might stuck for a while until we can get a good repro case.

The "resource" field is a new feature, but hazard detection should not be affected by it (you can ignore "resource" part of the message). It still might be a good idea to check if the reported race condition actually happens. It says that after the initial write to a buffer the barrier was set that allows VERTEX/COMPUTE shader to READ, but the last access (submitted usage) was COPY READ (source parameter of vkCmdCopyBuffer) and the COPY stage is not protected by the barrier.

There is a chance that GpuArray MatrixStore _primaryArray name is detected properly, so can be a hint it's about this buffer. seq_no can give some insights too. It is the index of the command in the command buffer. It indexes only commands that perform memory accesses (e.g. CmdCopyBuffer/CmdDraw/CmdDispatch but not CmdSetViewport). Because in the reported messages seq_no is quite small it should be possible to find manually the pair of commands that create a race condition.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Jul 17, 2024

Can confirm using the API dump that initial write (prior_usage: SYNC_COPY_TRANSFER_WRITE, seq_no: 5, reset_no: 8) was into 0x6eea0c0000000185[GpuArray MatrixStore _primaryArray]), and not into 0xad2b50000000316[PrimaryCullPass occludeds]. Line 1133339 in the dump file.

@StefanPoelloth
Copy link
Author

StefanPoelloth commented Jul 17, 2024

@artem-lunarg Yes, the 0xad2b50000000316[PrimaryCullPass occludeds] is only written from the compute shader.

We have done extensive analysis and I'm pretty confident that all barriers (for buffer 0x17758085a10[GpuArray MatrixStore _primaryArray]) are correctly placed.
Ive created a short API dump with 7 frames where frame 7 is giving me this sync hazard:

validation layer: Validation Error: [ SYNC-HAZARD-READ-AFTER-WRITE ] Object 0: handle = 0x17755332de0, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xe4d96472 | vkQueueSubmit(): Hazard READ_AFTER_WRITE for entry 0, VkCommandBuffer 0x17757fb6170[Frame 7], Submitted access info (submitted_usage: SYNC_COPY_TRANSFER_READ, command: vkCmdCopyBuffer, seq_no: 2, reset_no: 8, resource: VkBuffer 0x17758085a10[GpuArray MatrixStore _primaryArray]). Access info (prior_usage: SYNC_COPY_TRANSFER_WRITE, write_barriers: SYNC_VERTEX_SHADER_SHADER_STORAGE_READ|SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, queue: VkQueue 0x17755332de0[], submit: 16, batch: 0, batch_tag: 326, command: vkCmdCopyBuffer, command_buffer: VkCommandBuffer 0x177580720d0[Frame 6], seq_no: 5, reset_no: 4, resource: VkBuffer 0x17758098eb0[PrimaryCullPass occludedCount]).

My analysis shows that the buffer 0x17758085a10[GpuArray MatrixStore _primaryArray] is written to in frame 7 (and not written in frame 6). The write is protected with a barrier before: vertex|compute|copy/transferRead|storageRead -> copy/transferWrite and two barriers after copy/transferWrite -> compute/storageRead and copy/transferWrite -> vertex/storageRead.
My conclusion is that all barriers are correct and the validation error is a false positive.

My API dump: dump6.zip

My complete dump analysis for 0x17758085a10 for the 7 frames:

barrier format is: srcStageMask/srcAccessMask -> dstStageMask/dstAccessMask

frame 1:

  • create/bind/debugname 17758085a10
  • barrier: all commands/none -> copy/transferWrite
  • CopyBuffer write: staging -> 17758085a10
  • barrier: copy/transferWrite -> copy/transferRead
  • CopyBuffer read: 17758085a10 -> secondary matrix
  • barrier: copy/transferRead -> copy/transferWrite
  • CopyBuffer write: staging -> 17758085a10
  • barrier: copy/transferWrite -> compute/storageRead
  • Dispatch storage read PrimaryCullPass
  • barrier: copy/transferWrite -> vertex/storageRead
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

frame 2:

  • barrier: copy/transferWrite -> copy/transferRead
  • CopyBuffer read: 17758085a10 -> secondary matrix
  • Dispatch storage read PrimaryCullPass
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

frame 3:

  • CopyBuffer read: 17758085a10 -> secondary matrix
  • barrier: vertex|compute|copy/transferRead|storageRead -> copy/transferWrite
  • CopyBuffer write: staging -> 17758085a10
  • barrier: copy/transferWrite -> compute/storageRead
  • Dispatch storage read PrimaryCullPass
  • barrier: copy/transferWrite -> vertex/storageRead
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

frame 4:

  • barrier: copy/transferWrite -> copy/transferRead
  • CopyBuffer read: 17758085a10 -> secondary matrix
  • barrier: vertex|compute|copy/transferRead|storageRead -> copy/transferWrite
  • CopyBuffer write: staging -> 17758085a10
  • barrier: copy/transferWrite -> compute/storageRead
  • Dispatch storage read PrimaryCullPass
  • barrier: copy/transferWrite -> vertex/storageRead
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

frame 5:

  • barrier: copy/transferWrite -> copy/transferRead
  • CopyBuffer read: 17758085a10 -> secondary matrix
  • barrier: vertex|compute|copy/transferRead|storageRead -> copy/transferWrite
  • CopyBuffer write: staging -> 17758085a10
  • barrier: copy/transferWrite -> compute/storageRead
  • Dispatch storage read PrimaryCullPass
  • barrier: copy/transferWrite -> vertex/storageRead
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

frame 6:

  • barrier: copy/transferWrite -> copy/transferRead
  • CopyBuffer read: 17758085a10 -> secondary matrix
  • Dispatch storage read PrimaryCullPass
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

frame 7:

  • CopyBuffer read: 17758085a10 -> secondary matrix
  • barrier: vertex|compute|copy/transferRead|storageRead -> copy/transferWrite
  • CopyBuffer write: staging -> 17758085a10
  • barrier: copy/transferWrite -> compute/storageRead
  • Dispatch storage read PrimaryCullPass
  • barrier: copy/transferWrite -> vertex/storageRead
  • BeginRendering storage read PrimaryVisibilityPass
  • DispatchIndirect storage read SecondaryCullPass
  • BeginRendering storage read SecondaryVisibilityPass
  • Dispatch storage read AttributePass

I think the GPU-AV tag should be removed since its happening without GPU-AV.

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Jul 17, 2024

@StefanPoelloth Thanks for details! I also have done analysis of the barriers based on original dump and it looks correct for me. So it could be two problems here. False-positive error and incorrect labeling of the resource.

Additional documentation. Critical sequence from original dump for buffer 6EEA0C0000000185 that syncval complains about:

Begin: cmdbuf 000002D4EF775860
CmdCopyBuffer: dstBuffer [6EEA0C0000000185] // prior access

buffer barrier [6EEA0C0000000185]: 
    srcStageMask  : VK_PIPELINE_STAGE_2_COPY_BIT
    srcAccessMask : VK_ACCESS_2_TRANSFER_WRITE_BIT
    dstStageMask  : VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT
    dstAccessMask : VK_ACCESS_2_SHADER_STORAGE_READ_BIT

buffer barrier [6EEA0C0000000185]: 
    srcStageMask  : VK_PIPELINE_STAGE_2_COPY_BIT
    srcAccessMask : VK_ACCESS_2_TRANSFER_WRITE_BIT
    dstStageMask  : VK_PIPELINE_STAGE_2_VERTEX_SHADER_BIT
    dstAccessMask : VK_ACCESS_2_SHADER_STORAGE_READ_BIT

End: cmdbuf 000002D4EF775860
QueueSubmit: cmdbuf 000002D4EF775860

Begin: cmdbuf 000002D4EF7FD510
buffer barrier [6EEA0C0000000185]: 
    srcStageMask  : VK_PIPELINE_STAGE_2_COPY_BIT
    srcAccessMask : VK_ACCESS_2_TRANSFER_WRITE_BIT
    dstStageMask  : VK_PIPELINE_STAGE_2_COPY_BIT
    dstAccessMask : VK_ACCESS_2_TRANSFER_READ_BIT
    
CmdCopyBuffer: srcBuffer [6EEA0C0000000185] // last submitted access
End: cmdbuf 000002D4EF7FD510
QueueSubmit: cmdbuf 000002D4EF7FD510

@artem-lunarg artem-lunarg removed the GPU-AV GPU Assisted Validation label Jul 17, 2024
@artem-lunarg artem-lunarg changed the title SyncVal false-positive when GPU-AV is enabled False-positive READ-AFTER-WRITE hazard and incorrect resource label Jul 17, 2024
@artem-lunarg
Copy link
Contributor

@StefanPoelloth could you clarify few points:
a) You mentioned sometimes it's hard to reproduce. Is behavior non-deterministic and sometimes it runs without errors?
b) In those cases that detect the error, does application need to run for some time until it happens or is it usually in the first frames (in the provided messages the errors were reported pretty early, so I'm wondering if there is such correlation).

@StefanPoelloth
Copy link
Author

StefanPoelloth commented Jul 18, 2024

@artem-lunarg
a) With gpu-av it happened almost always in the first few frames. But when I disabled gpu-av, it was hard to reproduce. However I've tweaked my test case where its happening reliable in the first, lets say, ~20 frames. It will also happen very often, in a random run it happened on frames: 4, 20, 28, 34, 42, 47... I get around 125 errors in 1000 frames.
The buffers in prior_usage keep changing too, but it doesn't look random. Up until now i only saw buffers and sometimes images that are used the compute pass right after the update.

Some context:

  • It doesn't happen when i update the _primaryArray every frame.
  • It happened pretty often with gpu-av when occasionally updating the _primaryArray.
  • It was hard to reproduce when occasionally updating _primaryArray without gpu-av.
  • And its easy to reproduce when i update _primaryArray frequently but not always. I use if (_time % 0.1f < 0.05f) to move objects which eventually results in the barriers and buffer copies.

The last point results in the following updates (a random run, not related to any capture):

  • update on frame 0
  • update on frame 1
  • update on frame 3
  • Frame 4: SYNC-HAZARD-READ-AFTER-WRITE

b) I can reliable reproduce it in the first few (~20) frames on every application run. Usually it happens between frame 3 and 8.

EDIT:
I did some testing when to update the _primaryArray:

  • It doesn't happen when i update the _primaryArray every frame.
  • It happens once when updating with if (_frame % 2 == 0)
  • with if (_frame % 3 == 0) it happens every 3rd frame.

@artem-lunarg
Copy link
Contributor

thanks!

@artem-lunarg
Copy link
Contributor

artem-lunarg commented Jul 18, 2024

I wrote a test that simulates how 0x6eea0c0000000185 is updated (includes buffer copies, and barriers are generated based on whether the buffer was updated to match the pattern from the dump). So far I can't catch the issue. It's interesting I get exactly the same error as reported in this issue when I remove the barrier that protects 0x6eea0c0000000185 when it is used as a copy source (in the attached pseudo code it is the first barrier in most frames). Still in the API dump that barrier is properly generated when necessary.

@StefanPoelloth if you have opportunity to check the latest VVL code (new SDK will also be released soon) it would be interesting if it changes something. In the latest code syncval validation of descriptor accesses is disabled by default, because it can produce false-positives (old behavior can be enabled with setting VK_KHRONOS_VALIDATION_SYNCVAL_SHADER_ACCESSES_HEURISTIC environment variable to 1). My impression this issue is not related to shader accesses but it still would be good to get confirmation.

Pseudo code I use for investigation for documentation purposes:
buffer-frames.txt

@StefanPoelloth
Copy link
Author

@artem-lunarg Ive tested with a6d3fc5 yesterday and today with edcf314 and i can confirm its happening with "submit time validation" enabled and "shader access heuristic" disabled. Its not happening with "submit time validation" disabled and "shader access heuristic" enabled.

Ive tried to remove rendering code step by step and if i remove all rendering passes (draw, dispatch and everything related) and only leave in the commandbuffer creation, updating buffers and EndCommandBuffer. When i do that i get the following assertion:

image

Callstack:

> VkLayer_khronos_validation.dll!std::vector>::operator[](const unsigned __int64 _Pos) Line 1899 C++ VkLayer_khronos_validation.dll!CommandBufferAccessContext::GetHandleRecord(unsigned int handle_index) Line 334 C++ VkLayer_khronos_validation.dll!operator<<(std::basic_ostream> & out, const ResourceUsageRecord::FormatterState & formatter) Line 1204 C++ VkLayer_khronos_validation.dll!QueueBatchContext::FormatUsage(ResourceUsageTagEx tag_ex) Line 464 C++ VkLayer_khronos_validation.dll!SyncValidationInfo::FormatHazard(const HazardResult & hazard) Line 1255 C++ VkLayer_khronos_validation.dll!ReplayState::DetectFirstUseHazard(const sparse_container::range & first_use_range) Line 1235 C++ VkLayer_khronos_validation.dll!ReplayState::ValidateFirstUse() Line 1255 C++ VkLayer_khronos_validation.dll!QueueBatchContext::ValidateSubmit(const VkSubmitInfo2 & submit, unsigned __int64 submit_index, unsigned int batch_index, std::vector> & current_label_stack, const ErrorObject & error_obj) Line 534 C++ VkLayer_khronos_validation.dll!SyncValidator::ValidateQueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo2 * pSubmits, VkFence_T * fence, const ErrorObject & error_obj) Line 3079 C++ VkLayer_khronos_validation.dll!SyncValidator::PreCallValidateQueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo * pSubmits, VkFence_T * fence, const ErrorObject & error_obj) Line 3043 C++ VkLayer_khronos_validation.dll!vulkan_layer_chassis::QueueSubmit(VkQueue_T * queue, unsigned int submitCount, const VkSubmitInfo * pSubmits, VkFence_T * fence) Line 1490 C++ vulkan-1.dll!00007ff9f708e01a() Unknown [Managed to Native Transition] Silk.NET.Vulkan.dll!Silk.NET.Vulkan.Vk.QueueSubmit(Silk.NET.Vulkan.Queue queue, uint submitCount, Silk.NET.Vulkan.SubmitInfo pSubmits, Silk.NET.Vulkan.Fence fence) Line 10635 C# FS.VulkanRenderer.dll!FS.VulkanRenderer.VulkanRenderer.Draw(FS.VulkanRenderer.Scene.RenderScene renderScene, FS.VulkanRenderer.FrameInfo frameInfo, FS.VulkanRenderer.ImGuiImpl.Snapshot? imGuiSnapshot) Line 224 C# RenderDemo.dll!RenderDemo.Program.OnFrame(double delta) Line 334 C# Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.Internals.ViewImplementationBase.DoRender() Unknown Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.WindowExtensions.Run.AnonymousMethod__0() Unknown Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.Internals.ViewImplementationBase.Run(System.Action onFrame) Unknown Silk.NET.Windowing.Glfw.dll!Silk.NET.Windowing.Glfw.GlfwWindow.Run(System.Action onFrame) Unknown Silk.NET.Windowing.Common.dll!Silk.NET.Windowing.WindowExtensions.Run(Silk.NET.Windowing.IView view) Unknown RenderDemo.dll!RenderDemo.Program.Main() Line 225 C# [Native to Managed Transition] hostpolicy.dll!00007ff9545d2b76() Unknown hostpolicy.dll!00007ff9545d2e5c() Unknown hostpolicy.dll!00007ff9545d379a() Unknown hostfxr.dll!00007ff95462b5c9() Unknown hostfxr.dll!00007ff95462e066() Unknown hostfxr.dll!00007ff9546302ec() Unknown hostfxr.dll!00007ff95462e644() Unknown hostfxr.dll!00007ff9546285a0() Unknown RenderDemo.exe!00007ff7d84cf998() Unknown RenderDemo.exe!00007ff7d84cfda6() Unknown RenderDemo.exe!00007ff7d84d12e8() Unknown kernel32.dll!00007ffa49ca7344() Unknown ntdll.dll!00007ffa49efcc91() Unknown

@artem-lunarg
Copy link
Contributor

@StefanPoelloth Thanks for the confirmation, that's very helpful. Yes, this entire issue is related to "submit time validation" so makes sense to test with this option always enabled. It's valuable information that it is happening with "shader accesses" disabled - simplifies testing scenarios we might need to consider.

About assertion, I suspected this scenario and was going to disable "resource" reporting for this SDK release, and it's one more good confirmation.

p.s. I won't be available for the next few weeks, so might be not much progress here and probably won't be fixed for this SDK, but that's something we definitely will target to fix for the SDK after that.

artem-lunarg added a commit to artem-lunarg/Vulkan-ValidationLayers that referenced this issue Jul 19, 2024
KhronosGroup#8291
This can be related to command buffer lifetime management. Usage
records can reference handles from the command buffers that were Reset.
@artem-lunarg
Copy link
Contributor

artem-lunarg commented Jul 19, 2024

@StefanPoelloth do you run syncval alone or together with Core validation enabled (core checkbox)? Any Core validation errors?

Core validation errors can put syncval into inconsistent state and it can produce false-positives. The recommendation is to run core validation error free before enabling syncval. Syncval should not crash in this scenario though.

@StefanPoelloth
Copy link
Author

@artem-lunarg Core validation is disabled, everything except "Synchronization" and "Submit time validation" is disabled.

@artem-lunarg
Copy link
Contributor

Thanks, "Handle Wrapping" is good to have enabled though, it's a less tested path when it is disabled.

artem-lunarg added a commit that referenced this issue Jul 19, 2024
#8291
This can be related to command buffer lifetime management. Usage
records can reference handles from the command buffers that were Reset.
spencer-lunarg pushed a commit that referenced this issue Jul 19, 2024
#8291
This can be related to command buffer lifetime management. Usage
records can reference handles from the command buffers that were Reset.
@StefanPoelloth
Copy link
Author

@artem-lunarg Ive reduces the amount of api calls drastically and uploaded a new api dump to the share. Were working on repro code that i can share, but unfortunately we had no luck yet.

@artem-lunarg
Copy link
Contributor

@StefanPoelloth thanks for putting efforts into this. I have some ideas what could be wrong with the label and trying to come up with a repro case. It's more tricky with the reported error itself. Both API-dump and error message suggests that there was a barrier before copy so READs should be protected from previous WRITEs but somehow it did not work.

@artem-lunarg
Copy link
Contributor

@StefanPoelloth if that's something that can be quickly hacked (I know in some engines it can be tricky to do, then please ignore). If the program runs without present operations (no QueuePresent, no AcquireNextImage, QueueSubmit is adjusted not to wait semaphore from Acquire and not to signal semaphore for present). I wonder if this scenario also reproduces the error (and incorrect label). So far I tested without presentation.

@StefanPoelloth
Copy link
Author

StefanPoelloth commented Jul 22, 2024

@artem-lunarg I gave it a quick try and it doesnt seem to happen without present/semaphores.

@StefanPoelloth
Copy link
Author

StefanPoelloth commented Jul 23, 2024

@artem-lunarg Ive uploaded a sample project that reproduces the issue, checkout the included readme.

@artem-lunarg
Copy link
Contributor

@StefanPoelloth Thank you, I can see the assert from out of bounds access. One question, I'm not a .net user, is it possible to continue debugging VVL code after the assert is hit? When I press Retry button in assert window, that usually goes into the debugger for c++ project, here it terminates the app.

@StefanPoelloth
Copy link
Author

StefanPoelloth commented Jul 23, 2024

@artem-lunarg Im assuming you use visual studio:

Right click on the RenderDemo project and select Properties.
Select "Debug" on the left side and click the "Open debug launch profiles UI".
Scoll down a bit and check "Enable native code debugging".

this should create the following file: RenderDemo/Properties/launchSettings.json
launchSettings.json

It didnt hit an assert for me with a debug build of 9195994 🤔

@artem-lunarg
Copy link
Contributor

Thanks, debuging works now. I'm using debug build. The assert I get is exactly what we need, it's the GetHandleRecord call where array index is out of bounds, in debug builds MSVC std::vector implementation triggers asset.

@artem-lunarg
Copy link
Contributor

It didnt hit an assert for me with a debug build of

in the latest code we disabled resource reporting, that's probably the reason why there is no assert and only the validation error.

@StefanPoelloth
Copy link
Author

@artem-lunarg By changing the code very slightly, I was able to produce a WAW instead of RAW. Ive uploaded the repro code for this as well.

@artem-lunarg
Copy link
Contributor

Thank you @StefanPoelloth. We successfully reproduced the scenario within our testing framework with a compact unit test. I’ll keel posting updates here when we have fixes.

@artem-lunarg
Copy link
Contributor

@StefanPoelloth the fix is landed. If it does not fix the issue in your program, please reopen this ticket.

@StefanPoelloth
Copy link
Author

@artem-lunarg I can confirm this fixes the RAW and WAW false positives. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Synchronization Synchronization Validation Object Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants