-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200
Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200
Conversation
… implementations.
@@ -1208,841 +1209,841 @@ KernelCreateInfo BuildKernelCreateInfo<void>() { | |||
|
|||
static Status RegisterRocmKernels(KernelRegistry& kernel_registry) { | |||
static const BuildKernelCreateInfoFn function_table[] = { | |||
BuildKernelCreateInfo<void>, //default entry to avoid the list become empty after ops-reducing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of formatting changes in this file, can ignore whitespace when viewing diff
@@ -0,0 +1,46 @@ | |||
// Copyright (c) Microsoft Corporation. All rights reserved. | |||
// Licensed under the MIT License. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you may follow CPU EP's implementation. That one is thread-safe.
return registry; | ||
} | ||
|
||
static SharedPtrThreadSafeWrapper<KernelRegistry> s_kernel_registry{&CreateCudaKernelRegistry}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest not having global initializers. If you change it to function local, things will be much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function local static would be nice, but it needs to be reset by another function. @RyanUnderhill knows more about why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, in shared providers function local statics whose destructors depend on core code will be destroyed too late. They'll crash as the core code it depends on is already uninitialized. So we have to make them explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even it is function local, you still can call Reset() manually. Function local delays initiation to run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean doing something like:
static std::shared_ptr<KernelRegistry>& CudaKernelRegistry() {
static std::shared_ptr<KernelRegistry> registry = ...
return registry;
}
std::shared_ptr<KernelRegistry> CUDAExecutionProvider::GetKernelRegistry() const {
return CudaKernelRegistry();
}
void Shutdown_DeleteRegistry() {
CudaKernelRegistry().reset();
}
But that has different behavior when calls to GetKernelRegistry() and Shutdown_DeleteRegistry() are interleaved. E.g., GetKernelRegistry() will return an empty shared_ptr after Shutdown_DeleteRegistry() is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Craigacp We can match the same behavior as C# for Java as well. Which part won't be a singleton like C#? C# does not allow the user to configure it on instantiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to remove useful functionality like configuring a threadpool and setting the logging level. Switching it over to a singleton will mean the environment is constructed at an unpredictable time (as it's based on class initialization) and there's no potential for a user to close it to free up resources while keeping the JVM alive. A shutdown hook will clean up the environment most of the time (apart from when the JVM gets SIGKILLed or similar) and we'd need that for the singleton anyway. Allowing user controlled construction avoids the first problem while giving more flexibility.
What kind of things live inside the environment object? Is it worth allowing users to close it explicitly if they are finished with ORT, under the assumption that it throws some kind of exception if they try to make another one in the same process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed #10670 which switches the environment over so that only one can be created in the lifetime of the JVM (unless the user starts messing with class loaders), and its closed by a JVM shutdown hook. I left the close method in place for compatibility though it is now a no-op. If we want the environment to be closeable to release resources then I'll need to do some more refactoring as the current close idiom will be very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of resources the most important one that the Environment contains is the threadpool and that too only if you create the env with the global threadpool option. See here. Besides this there are global registrations for various schemas that can't be undone.
The OrtEnv obj exposed in the C API wraps the internal Environment class. Each time you request an OrtEnv object the refcount is incremented for the same instance in the process. We do provide a Release method that decrements the refcount and finally gets rid of the instance allowing another instance to be created in the same process. The Release method calls the OrtEnv destructor which unloads the shared libs.
The C# wrapper doesn't provide any of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to have a similar refcount in the Java side and it would hand out the same Java reference to the native OrtEnv, but it would call the destructor when the count gets to zero and allow users to build a fresh environment after that, which is the source of the trouble. It doesn't look like there are too many resources held by it, so it's probably ok without an explicit close.
onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc
Outdated
Show resolved
Hide resolved
thread = std::thread{load_model_thread_fn}; | ||
} | ||
|
||
for (auto& thread : threads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned that in the event of the test failure, we won't get a good diagnostics since we are going to destroy unjoined and undetached threads. And that would call std::terminate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to catch exceptions in the thread function before, but it turns out that wasn't the only way it was failing. For example, sometimes debug assertions from the standard library would be hit which also stop the process. I didn't make any further attempt to fail gracefully. Do you have any suggestions?
This reverts commit 0a68c6a.
How is it going? I hope it will be in the upcoming release. |
…nel_registry_thread_safety
This reverts commit cc12013.
Description
Make GetKernelRegistry() kernel registry initialization thread-safe.
Motivation and Context
Fix #10179