Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unload drivers with no physical devices + Refactor instance level objects to not use icd_index #1471

Conversation

charles-lunarg
Copy link
Collaborator

The loader did not unload any ICD's which contained zero physical devices, which
could cause premature exhaustion of memory in some circumstances, like 32 bit
applications. While the policy of the loader has been to keep things open for
the duration of the instance, these ICD's don't meaningfully participate in
anything due to the lack of VkPhysicalDevices.

This change adds a check after vkEnumeratePhysicalDevices where pPhysicalDevices
is not NULL such that all loader_icd_terms which reported zero physical devices
have its vkDestroyInstance called, and removed from the loader_instance's
icd_term linked list.

The previous way per-ICD instance level objects were accessed was
using the ICD's index into an array that was allocated with the object.
This solution worked while the indexes were static, but with the
recent change to remove unused ICD's that is no longer the case.

This commit replaces an array per object with object arrays, one for each
type (surface, debug messenger, & debug report) and per ICD. That flips
where the index comes from, with the instance storing an array indication
which indices are used and which are free.

Whenever an instance level object is created, the loader checks if there
is a free index available, reusing it if available. Otherwise it resizes
its own store as well as each ICD's array for that object.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build queued with queue ID 167885.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2526 running.

@charles-lunarg charles-lunarg linked an issue Apr 15, 2024 that may be closed by this pull request
@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2526 passed.

Adds a simple executable that times how long vkEnumerateInstanceExtensionProperties
takes over and over to see how well the ICD preloading functions.
The loader did not unload any ICD's which contained zero physical devices, which
could cause premature exhaustion of memory in some circumstances, like 32 bit
applications. While the policy of the loader has been to keep things open for
the duration of the instance, these ICD's don't meaningfully participate in
anything due to the lack of VkPhysicalDevices.

This change adds a check after vkEnumeratePhysicalDevices where pPhysicalDevices
is not NULL such that all loader_icd_terms which reported zero physical devices
have its vkDestroyInstance called, and removed from the loader_instance's
icd_term linked list.
@charles-lunarg charles-lunarg force-pushed the unload_drivers_with_no_physical_devices branch from 58a8668 to 434898f Compare April 24, 2024 21:43
@ci-tester-lunarg
Copy link

CI Vulkan-Loader build queued with queue ID 173426.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2536 running.

loader/loader.c Fixed Show fixed Hide fixed
@ci-tester-lunarg
Copy link

CI Vulkan-Loader build queued with queue ID 173438.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2537 running.

The index must be referenced against the loader's internal index of each
ICD. Instead, we should print the lib_name in the error message, making
it more clear which driver the error is coming from.
Drivers resize after 32 elements, so to test that path we need to loop
over instance level handle creation (surface, debug messenger, debug
report).

The driver unloading tests needed to create a debug report callback, so
that functionality was added to the test framework, modifying
VulkanFunctions with a new init function and to make InstWrapper call it
when creating an instance.

Modify how test_icd_version_7 operates so that by default the functions
are exported which is the 'assumed' codepath. This results in a bit of
duplication between version 6 & 7, but was kept so as to not modify
every test. This also clarifies how a test should enable querying of
the functions through vkGetInstanceProcAddr versus exporting those
functions (it was combined before).
The previous way per-ICD instance level objects were accessed was
using the ICD's index into an array that was allocated with the object.
This solution worked while the indexes were static, but with the
recent change to remove unused ICD's that is no longer the case.

This commit replaces an array per object with object arrays, one for each
type (surface, debug messenger, & debug report) and per ICD. That flips
where the index comes from, with the instance storing an array indication
which indices are used and which are free.

Whenever an instance level object is created, the loader checks if there
is a free index available, reusing it if available. Otherwise it resizes
its own store as well as each ICD's array for that object.
@charles-lunarg charles-lunarg force-pushed the unload_drivers_with_no_physical_devices branch from 2633b3f to dafddb6 Compare April 24, 2024 21:49
@ci-tester-lunarg
Copy link

CI Vulkan-Loader build queued with queue ID 173450.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2538 running.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2538 failed.

Necessary for any functions called across dll boundaries on 32 bit windows.
@ci-tester-lunarg
Copy link

CI Vulkan-Loader build queued with queue ID 174024.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2539 running.

@ci-tester-lunarg
Copy link

CI Vulkan-Loader build # 2539 passed.

@charles-lunarg charles-lunarg merged commit 8cd9956 into KhronosGroup:main Apr 29, 2024
43 checks passed
@charles-lunarg charles-lunarg deleted the unload_drivers_with_no_physical_devices branch April 29, 2024 16:53
@charles-lunarg
Copy link
Collaborator Author

@ivyl Just a notice that the revised version has been merged - the original version has critical defects requiring reverting. The current version is more complicated (and thus more liable to be buggy) but has significantly better testing to go along side it.

@Kangz
Copy link
Contributor

Kangz commented May 6, 2024

Note that this PR still causes crashes on some systems. (I bisected the issue below to this PR). On a fairly vanilla Debian testing on a lenovo thinkpad with an Intel GPU I'm getting the following crash:

Thread 1 "DawnInfo" received signal SIGSEGV, Segmentation fault.
___pthread_mutex_lock (mutex=0x200) at ./nptl/pthread_mutex_lock.c:80                                                                                               
80	./nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  ___pthread_mutex_lock (mutex=0x200) at ./nptl/pthread_mutex_lock.c:80
#1  0x00007fffe6c61bdd in ?? () from /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so
#2  0x00007fffe6baecfb in ?? () from /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so
#3  0x00007ffff5b21d2c in loader_icd_destroy (ptr_inst=0x179c00174000, icd_term=0x179c0012fc00, pAllocator=0x179c001753d8) at ../../third_party/vulkan-deps/vulkan-loader/src/loader/loader.c:1376
#4  0x00007ffff5b3027c in unload_drivers_without_physical_devices (inst=0x179c00174000) at ../../third_party/vulkan-deps/vulkan-loader/src/loader/loader.c:6482
#5  0x00007ffff5b3a9c3 in vkEnumeratePhysicalDevices (instance=0x179c00174000, pPhysicalDeviceCount=0x7fffffffd144, pPhysicalDevices=0x179c00074480)
    at ../../third_party/vulkan-deps/vulkan-loader/src/loader/trampoline.c:911

Is the flow for the unloading of drivers in that path the same flow as on vkDestroyInstance?

@charles-lunarg
Copy link
Collaborator Author

Yes, because the drivers are being unloaded during EnumPhysDevs, the loader does the expected vkDestroyInstance tear down for that driver. That doesn't that this teardown is bug free so I will investigate this.

@charles-lunarg
Copy link
Collaborator Author

Found the problem, code was doing 2 bad things.

  1. Trying to destroy objects after call vkDestroyInstance
  2. Passing a pAllocator with NULL for every member. Thats just... silly. And my bad.

@Kangz
Copy link
Contributor

Kangz commented May 6, 2024

Thank you for the quick investigation! I can try a fixup commit when you have one to confirm if it fixes the issue!

@charles-lunarg
Copy link
Collaborator Author

Yes, PR is up. Would be great if you could try it. #1481

@y-novikov
Copy link
Contributor

Even after the fix above, ANGLE tests still have problem on Linux Intel:
https://chromium-review.googlesource.com/c/angle/angle/+/5525404
https://ci.chromium.org/ui/p/angle/builders/try/linux-test/20294/overview
https://chromium-swarm.appspot.com/task?id=6970cff759810c10
[ RUN ] ContextLostSkipValidationTest.LostNoErrorGetProgram/ES3_Vulkan
angle::PrintStackBacktrace() at crash_handler_posix.cpp:498
angle::Handler(int) at crash_handler_posix.cpp:659
killpg at ??:?
explicit_bzero at strcmp-sse2-unaligned.S:31
unload_drivers_without_physical_devices at loader.c:6530
vkEnumeratePhysicalDevices at trampoline.c:?
rx::vk::Renderer::initialize(rx::vk::Context*, rx::vk::GlobalOps*, angle::vk::ICD, unsigned int, unsigned int, rx::vk::UseValidationLayers, char const*, char const*, angle::NativeWindowSystem, angle::FeatureOverrides const&) at vk_renderer.cpp:1956
rx::DisplayVk::initialize(egl::Display*) at DisplayVk.cpp:177
rx::DisplayVkXcb::initialize(egl::Display*) at DisplayVkXcb.cpp:65
isError at Error.inc:82
isError at Error.inc:82
EGL_Initialize at ??:?
EGLWindow::initializeDisplay(OSWindow*, angle::Library*, angle::GLESDriverType, EGLPlatformParameters const&) at EGLWindow.cpp:301
ANGLETestBase::ANGLETestSetUp() at ANGLETest.cpp:723
SetUp at ANGLETest.h:645
testing::Test::Run() at gtest.cc:2704
testing::TestInfo::Run() at gtest.cc:2888
testing::TestSuite::Run() at gtest.cc:3040
testing::internal::UnitTestImpl::RunAllTests() at gtest.cc:5898
testing::UnitTest::Run() at gtest.cc:5464
RUN_ALL_TESTS at gtest.h:2492
main at angle_end2end_tests_main.cpp:75
__libc_start_main at libc-start.c:344
_start at ??:?

@y-novikov
Copy link
Contributor

I was able to repo with angle_end2end_tests --gtest_filter=VulkanExternalImageTest.TextureFormatCompatChromiumMutableNoStorageFd/ES3_Vulkan on Ubuntu 18.04 Intel UHD 630 GPU Mesa 20.0.8.

Here is a better stack:
#0 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:31
#1 0x00007ffff0d23163 in unload_drivers_without_physical_devices () at ../../third_party/vulkan-deps/vulkan-loader/src/loader/loader.c:6530
#2 0x00007ffff0d27d5b in vkEnumeratePhysicalDevices () at ../../third_party/vulkan-deps/vulkan-loader/src/loader/trampoline.c:912
#3 0x0000555556e92eda in EnumeratePhysicalDevices () at ../../src/tests/test_utils/VulkanHelper.cpp:32
#4 initialize () at ../../src/tests/test_utils/VulkanHelper.cpp:238
#5 0x0000555556e5ac63 in RunTextureFormatCompatChromiumTest<angle::(anonymous namespace)::OpaqueFdTraits> ()
at ../../src/tests/gl_tests/VulkanExternalImageTest.cpp:519
#6 0x0000555556e5c03d in TestBody () at ../../src/tests/gl_tests/VulkanExternalImageTest.cpp:630
#7 0x0000555556ec5d0d in HandleExceptionsInMethodIfSupported<testing::Test, void> () at ../../third_party/googletest/src/googletest/src/gtest.cc:5158
#8 Run () at ../../third_party/googletest/src/googletest/src/gtest.cc:2706
#9 0x0000555556ec6c0c in Run () at ../../third_party/googletest/src/googletest/src/gtest.cc:2885
#10 0x0000555556ec75b7 in Run () at ../../third_party/googletest/src/googletest/src/gtest.cc:3039
#11 0x0000555556ed6337 in RunAllTests () at ../../third_party/googletest/src/googletest/src/gtest.cc:5897
#12 0x0000555556ed5c05 in HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> () at ../../third_party/googletest/src/googletest/src/gtest.cc:5158
#13 Run () at ../../third_party/googletest/src/googletest/src/gtest.cc:5464
#14 0x0000555556ea42fa in RUN_ALL_TESTS () at ../../third_party/googletest/src/googletest/include/gtest/gtest.h:2492
#15 run () at ../../src/tests/test_utils/runner/TestSuite.cpp:1660
#16 0x0000555556e4d233 in main () at ../../src/tests/angle_end2end_tests_main.cpp:75

This is the line where it crashes:
https://chromium.googlesource.com/external/github.com/KhronosGroup/Vulkan-Loader.git/+/e69a59a96b241038f24a0e425445d001ea099b2c/loader/loader.c#6530

I attach VK_LOADER_DEBUG log:
log.txt

It looks suspicious to me that there are 2 lines of:
INFO | DRIVER: Removing driver /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so due to not having any physical devices

Does the loader remove the driver twice from the list?

@charles-lunarg
Copy link
Collaborator Author

It looks suspicious to me that there are 2 lines of:
INFO | DRIVER: Removing driver /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so due to not having any physical devices

Does the loader remove the driver twice from the list?

No, that appears to be from multiple instances being created - as in the line appears in separate calls to EnumeratePhysicalDevices. That said, it could be that there are multiple createInstance calls which cause the preloaded ICD's array to be partially filled, leading to nullptr dereferences.

@charles-lunarg
Copy link
Collaborator Author

Ah, so I've identified the 'bug' and why it doesn't crash locally on my machine - strcmp is UB when the passed in pointers are NULL. Which includes "not crashing", and apparently glibc for ubuntu 18 differs.

Thanks for reporting, I'll have a fix up shortly.

@charles-lunarg
Copy link
Collaborator Author

PR is up - would be swell if you can double check but I'm going to merge when it passes CI regardless.
#1484

@y-novikov
Copy link
Contributor

That worked!
https://chromium-review.googlesource.com/c/angle/angle/+/5525919 passed with that PR.
Thanks a lot for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash because of unload_drivers_without_physical_devices()
4 participants