Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200

Merged
merged 7 commits into from
Mar 2, 2022

Conversation

edgchen1
Copy link
Contributor

@edgchen1 edgchen1 commented Jan 5, 2022

Description
Make GetKernelRegistry() kernel registry initialization thread-safe.

Motivation and Context
Fix #10179

@@ -1208,841 +1209,841 @@ KernelCreateInfo BuildKernelCreateInfo<void>() {

static Status RegisterRocmKernels(KernelRegistry& kernel_registry) {
static const BuildKernelCreateInfoFn function_table[] = {
BuildKernelCreateInfo<void>, //default entry to avoid the list become empty after ops-reducing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of formatting changes in this file, can ignore whitespace when viewing diff

@@ -0,0 +1,46 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you may follow CPU EP's implementation. That one is thread-safe.

return registry;
}

static SharedPtrThreadSafeWrapper<KernelRegistry> s_kernel_registry{&CreateCudaKernelRegistry};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest not having global initializers. If you change it to function local, things will be much simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function local static would be nice, but it needs to be reset by another function. @RyanUnderhill knows more about why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, in shared providers function local statics whose destructors depend on core code will be destroyed too late. They'll crash as the core code it depends on is already uninitialized. So we have to make them explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even it is function local, you still can call Reset() manually. Function local delays initiation to run time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean doing something like:

static std::shared_ptr<KernelRegistry>& CudaKernelRegistry() {
  static std::shared_ptr<KernelRegistry> registry = ...
  return registry;
}

std::shared_ptr<KernelRegistry> CUDAExecutionProvider::GetKernelRegistry() const {
  return CudaKernelRegistry();
}

void Shutdown_DeleteRegistry() {
  CudaKernelRegistry().reset();
}

But that has different behavior when calls to GetKernelRegistry() and Shutdown_DeleteRegistry() are interleaved. E.g., GetKernelRegistry() will return an empty shared_ptr after Shutdown_DeleteRegistry() is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Craigacp We can match the same behavior as C# for Java as well. Which part won't be a singleton like C#? C# does not allow the user to configure it on instantiation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to remove useful functionality like configuring a threadpool and setting the logging level. Switching it over to a singleton will mean the environment is constructed at an unpredictable time (as it's based on class initialization) and there's no potential for a user to close it to free up resources while keeping the JVM alive. A shutdown hook will clean up the environment most of the time (apart from when the JVM gets SIGKILLed or similar) and we'd need that for the singleton anyway. Allowing user controlled construction avoids the first problem while giving more flexibility.

What kind of things live inside the environment object? Is it worth allowing users to close it explicitly if they are finished with ORT, under the assumption that it throws some kind of exception if they try to make another one in the same process?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #10670 which switches the environment over so that only one can be created in the lifetime of the JVM (unless the user starts messing with class loaders), and its closed by a JVM shutdown hook. I left the close method in place for compatibility though it is now a no-op. If we want the environment to be closeable to release resources then I'll need to do some more refactoring as the current close idiom will be very confusing.

Copy link
Contributor

@pranavsharma pranavsharma Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of resources the most important one that the Environment contains is the threadpool and that too only if you create the env with the global threadpool option. See here. Besides this there are global registrations for various schemas that can't be undone.

The OrtEnv obj exposed in the C API wraps the internal Environment class. Each time you request an OrtEnv object the refcount is incremented for the same instance in the process. We do provide a Release method that decrements the refcount and finally gets rid of the instance allowing another instance to be created in the same process. The Release method calls the OrtEnv destructor which unloads the shared libs.

The C# wrapper doesn't provide any of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to have a similar refcount in the Java side and it would hand out the same Java reference to the native OrtEnv, but it would call the destructor when the count gets to zero and allow users to build a fresh environment after that, which is the source of the trouble. It doesn't look like there are too many resources held by it, so it's probably ok without an explicit close.

thread = std::thread{load_model_thread_fn};
}

for (auto& thread : threads) {
Copy link
Member

@yuslepukhin yuslepukhin Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that in the event of the test failure, we won't get a good diagnostics since we are going to destroy unjoined and undetached threads. And that would call std::terminate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to catch exceptions in the thread function before, but it turns out that wasn't the only way it was failing. For example, sometimes debug assertions from the standard library would be hit which also stop the process. I didn't make any further attempt to fail gracefully. Do you have any suggestions?

snnn
snnn previously approved these changes Jan 7, 2022
pranavsharma
pranavsharma previously approved these changes Jan 7, 2022
@edgchen1 edgchen1 dismissed stale reviews from pranavsharma and snnn via cc12013 January 8, 2022 00:02
@snnn snnn self-requested a review January 12, 2022 22:11
@snnn
Copy link
Member

snnn commented Feb 23, 2022

How is it going? I hope it will be in the upcoming release.

@edgchen1 edgchen1 merged commit d07a237 into master Mar 2, 2022
@edgchen1 edgchen1 deleted the edgchen1/ep_get_kernel_registry_thread_safety branch March 2, 2022 01:54
chilo-ms added a commit that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple ONNX models with individual threads using CUDA EP result in random crashes
6 participants