Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replay staging buffer binding for when the capture/replay devices are not the same. #1115

Closed

Conversation

jacobv-nvidia
Copy link
Contributor

While testing the replay of trimmed capture files on devices other than the original capture device, I encountered an issue in the rebind allocator where allocations would fail, causing the replay to abort.

The initial loading allocations would fail, and, when running the replay with the --validate option, I would receive messages such as the following:

VUID-vkMapMemory-memory-00682(ERROR / SPEC): msgNum: -330527817 - Validation Error: [ VUID-vkMapMemory-memory-00682 ] Object 0: handle = 0x4b07580000000ca2, type = VK_OBJECT_TYPE_DEVICE_MEMORY; | MessageID = 0xec4c8bb7 | Mapping Memory without VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT set: VkDeviceMemory 0x4b07580000000ca2[]. The Vulkan spec states: memory must have been created with a memory type that reports VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT (https://vulkan.lunarg.com/doc/view/1.3.239.0/windows/1.3-extensions/vkspec.html#VUID-vkMapMemory-memory-00682)
    Objects: 1
	[0] 0x4b07580000000ca2, type: 8, name: NULL

The issue seems to be in the rebind allocator's "Direct"` binding functions, which are supposed to be performed without memory translation for the purposes of, for example, replay staging buffers. However, these functions were using the capture device's memory types instead of the replay device's.

This patch should fix the issue.

jacobv-nvidia and others added 2 commits April 23, 2023 14:46
This commit fixes an issue with erroneous memory type selection when
binding buffer memory without memory translation, e.g. in cases of
binding replay staging buffers. Currently, the code retrieves the
capture device's memory types, even though we want to bind the
memory directly from the replay device.

This issue led to various replay issues with mapping memory later on
in the replay, and was particularly disruptive when attempting to
replay a trimmed capture on a machine with a GPU of a different vendor
than the capture system.
@ci-tester-lunarg
Copy link

Author jacobv-nvidia not on autobuild list. Waiting for curator authorization before starting CI build.

@CLAassistant
Copy link

CLAassistant commented May 4, 2023

CLA assistant check
All committers have signed the CLA.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 10524.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2769 running.

Copy link
Contributor

@andrew-lunarg andrew-lunarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a very useful fix, so thanks for the PR!

One point, It might be cleaner to extend BindBufferMemory and BindImageMemory with the extra bool parameter and give it a default value of true. Then the two new functions with slightly vague names wouldn't be needed.

@jacobv-nvidia
Copy link
Contributor Author

jacobv-nvidia commented May 4, 2023

@andrew-lunarg Yes, I was on the fence about that. I was trying to avoid cluttering the parent interface of the more general class VulkanResourceAllocator. Do you think it's worth it?

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2769 passed.

@locke-lunarg locke-lunarg self-requested a review May 26, 2023 19:16
Copy link
Contributor

@bradgrantham-lunarg bradgrantham-lunarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR! I have proposed a small change to the calling sequence in a comment.

@@ -409,7 +410,9 @@ VkResult VulkanRebindAllocator::BindBufferMemory(VkBuffer buffer,
create_info.flags = 0;
create_info.usage = GetBufferMemoryUsage(
resource_alloc_info->usage,
capture_memory_properties_.memoryTypes[memory_alloc_info->original_index].propertyFlags,
is_direct_allocation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize it's taken me so long to look at this. I think this function doesn't need to care whether the access is direct or not. I propose elevating the condition into the callers, changing the bool parameter to a const VkPhysicalDeviceMemoryProperties&, and then passing in the desired memory properties inside the direct and non-direct functions. Also, there's no need to call the sub-functions Helper, it's okay for them to be differentiated by parameters. @jacobv-nvidia may I make that change against your PR using GitHub's edit mechanism inside the PR viewer? An example is at https://github.com/bradgrantham-lunarg/gfxreconstruct/tree/brad-tweak-1115 that works against our local CI, but I'd like to run it on a known failure case if you have one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, go ahead. I agree, your new change seems like a much less hacky implementation of the fix.

@bradgrantham-lunarg
Copy link
Contributor

I tried to take a trimmed capture of vkcube from llvmpipe and replay it on NVIDIA with --validate but got no validation error for vkMapMemory from SDK 1.3.246's binaries.@jacobv-nvidia , what's a good test case for this?

@jacobv-nvidia
Copy link
Contributor Author

I tried to take a trimmed capture of vkcube from llvmpipe and replay it on NVIDIA with --validate but got no validation error for vkMapMemory from SDK 1.3.246's binaries.@jacobv-nvidia , what's a good test case for this?

I have just successfully tested it with vkcube, so that should work as a good test case, but you do need to:

  1. use the vkcube option --use_staging
  2. avoid including the first frame as part of the trim.

For example: python gfxrecon-capture-vulkan.py -o cube.gfxr --capture-frames 2-16 vkcube --use_staging

I haven't really tested on llvmpipe, so can't speak too much on that part. The issue was discovered and tested using an NVIDIA trace on an AMD card, but I suppose it's possible that the relevant memory type indices just happen to match up between llvmpipe and NVIDIA.

@bradgrantham-lunarg
Copy link
Contributor

Merged as #1173.

We encountered this this week in our internal CI and this solved our problem so I prioritized it.

Thank you for finding this @jacobv-nvidia !

@jacobv-nvidia
Copy link
Contributor Author

@bradgrantham-lunarg Apologies for the delay in responding, I thought I had replied to your earlier comment on your changes. Anyway, I appreciate the merge.

@jacobv-nvidia jacobv-nvidia deleted the fix-staging-buffer-memory-types branch March 18, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants