Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200

edgchen1 · 2022-01-05T19:05:26Z

Description
Make GetKernelRegistry() kernel registry initialization thread-safe.

Motivation and Context
Fix #10179

… implementations.

edgchen1 · 2022-01-05T19:06:26Z

onnxruntime/core/providers/rocm/rocm_execution_provider.cc

@@ -1208,841 +1209,841 @@ KernelCreateInfo BuildKernelCreateInfo<void>() {

 static Status RegisterRocmKernels(KernelRegistry& kernel_registry) {
  static const BuildKernelCreateInfoFn function_table[] = {
-    BuildKernelCreateInfo<void>,  //default entry to avoid the list become empty after ops-reducing


lots of formatting changes in this file, can ignore whitespace when viewing diff

snnn · 2022-01-05T20:17:40Z

onnxruntime/core/common/shared_ptr_thread_safe_wrapper.h

@@ -0,0 +1,46 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+


Or you may follow CPU EP's implementation. That one is thread-safe.

snnn · 2022-01-05T20:18:38Z

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

+  return registry;
+}
+
+static SharedPtrThreadSafeWrapper<KernelRegistry> s_kernel_registry{&CreateCudaKernelRegistry};


I would suggest not having global initializers. If you change it to function local, things will be much simpler.

Function local static would be nice, but it needs to be reset by another function. @RyanUnderhill knows more about why.

Yep, in shared providers function local statics whose destructors depend on core code will be destroyed too late. They'll crash as the core code it depends on is already uninitialized. So we have to make them explicit.

Even it is function local, you still can call Reset() manually. Function local delays initiation to run time.

I guess you mean doing something like:

static std::shared_ptr<KernelRegistry>& CudaKernelRegistry() { static std::shared_ptr<KernelRegistry> registry = ... return registry; } std::shared_ptr<KernelRegistry> CUDAExecutionProvider::GetKernelRegistry() const { return CudaKernelRegistry(); } void Shutdown_DeleteRegistry() { CudaKernelRegistry().reset(); }

But that has different behavior when calls to GetKernelRegistry() and Shutdown_DeleteRegistry() are interleaved. E.g., GetKernelRegistry() will return an empty shared_ptr after Shutdown_DeleteRegistry() is called.

@Craigacp We can match the same behavior as C# for Java as well. Which part won't be a singleton like C#? C# does not allow the user to configure it on instantiation.

I'd prefer not to remove useful functionality like configuring a threadpool and setting the logging level. Switching it over to a singleton will mean the environment is constructed at an unpredictable time (as it's based on class initialization) and there's no potential for a user to close it to free up resources while keeping the JVM alive. A shutdown hook will clean up the environment most of the time (apart from when the JVM gets SIGKILLed or similar) and we'd need that for the singleton anyway. Allowing user controlled construction avoids the first problem while giving more flexibility.

What kind of things live inside the environment object? Is it worth allowing users to close it explicitly if they are finished with ORT, under the assumption that it throws some kind of exception if they try to make another one in the same process?

I've filed #10670 which switches the environment over so that only one can be created in the lifetime of the JVM (unless the user starts messing with class loaders), and its closed by a JVM shutdown hook. I left the close method in place for compatibility though it is now a no-op. If we want the environment to be closeable to release resources then I'll need to do some more refactoring as the current close idiom will be very confusing.

In terms of resources the most important one that the Environment contains is the threadpool and that too only if you create the env with the global threadpool option. See here. Besides this there are global registrations for various schemas that can't be undone.

The OrtEnv obj exposed in the C API wraps the internal Environment class. Each time you request an OrtEnv object the refcount is incremented for the same instance in the process. We do provide a Release method that decrements the refcount and finally gets rid of the instance allowing another instance to be created in the same process. The Release method calls the OrtEnv destructor which unloads the shared libs.

The C# wrapper doesn't provide any of this.

I used to have a similar refcount in the Java side and it would hand out the same Java reference to the native OrtEnv, but it would call the destructor when the count gets to zero and allow users to build a fresh environment after that, which is the source of the trouble. It doesn't look like there are too many resources held by it, so it's probably ok without an explicit close.

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc

…nel_registry_thread_safety

yuslepukhin · 2022-01-07T18:20:51Z

onnxruntime/test/shared_lib/test_inference.cc

+      thread = std::thread{load_model_thread_fn};
+    }
+
+    for (auto& thread : threads) {


I am concerned that in the event of the test failure, we won't get a good diagnostics since we are going to destroy unjoined and undetached threads. And that would call std::terminate

I tried to catch exceptions in the thread function before, but it turns out that wasn't the only way it was failing. For example, sometimes debug assertions from the standard library would be hit which also stop the process. I didn't make any further attempt to fail gracefully. Do you have any suggestions?

onnxruntime/core/providers/cuda/cuda_execution_provider.cc

This reverts commit 0a68c6a.

snnn · 2022-02-23T19:48:13Z

How is it going? I hope it will be in the upcoming release.

…nel_registry_thread_safety

This reverts commit cc12013.

…gistry() implementations. (#10200)" This reverts commit d07a237.

Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry()…

c73f198

… implementations.

edgchen1 requested review from pranavsharma and RyanUnderhill January 5, 2022 19:05

edgchen1 commented Jan 5, 2022

View reviewed changes

snnn reviewed Jan 5, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Outdated Show resolved Hide resolved

snnn reviewed Jan 5, 2022

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc Outdated Show resolved Hide resolved

edgchen1 added 2 commits January 6, 2022 16:07

Use function-local static variable instead.

0a68c6a

Merge remote-tracking branch 'origin/master' into edgchen1/ep_get_ker…

794f3d3

…nel_registry_thread_safety

yuslepukhin reviewed Jan 7, 2022

View reviewed changes

snnn previously approved these changes Jan 7, 2022

View reviewed changes

pranavsharma reviewed Jan 7, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cuda_execution_provider.cc Show resolved Hide resolved

pranavsharma previously approved these changes Jan 7, 2022

View reviewed changes

Revert "Use function-local static variable instead."

cc12013

This reverts commit 0a68c6a.

edgchen1 dismissed stale reviews from pranavsharma and snnn via cc12013 January 8, 2022 00:02

snnn self-requested a review January 12, 2022 22:11

Merge branch 'master' into edgchen1/ep_get_kernel_registry_thread_safety

5b8db3f

Craigacp mentioned this pull request Feb 25, 2022

[java] Changes OrtEnvironment so it can't be closed by users #10670

Merged

edgchen1 added 2 commits March 1, 2022 11:35

Merge remote-tracking branch 'origin/master' into edgchen1/ep_get_ker…

03c63af

…nel_registry_thread_safety

Revert "Revert "Use function-local static variable instead.""

97e4c85

This reverts commit cc12013.

snnn approved these changes Mar 2, 2022

View reviewed changes

edgchen1 merged commit d07a237 into master Mar 2, 2022

edgchen1 deleted the edgchen1/ep_get_kernel_registry_thread_safety branch March 2, 2022 01:54

Craigacp mentioned this pull request Mar 8, 2022

Revert "[java] Changes OrtEnvironment so it can't be closed by users" #10808

Closed

chilo-ms added a commit that referenced this pull request Mar 8, 2022

Revert "Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRe…

04e91fb

…gistry() implementations. (#10200)" This reverts commit d07a237.

edgchen1 mentioned this pull request Feb 23, 2023

[C#] Allow passing various options when creating singleton Environment object. #14723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200

Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200

edgchen1 commented Jan 5, 2022

edgchen1 Jan 5, 2022

snnn Jan 5, 2022

snnn Jan 5, 2022

edgchen1 Jan 5, 2022

RyanUnderhill Jan 5, 2022

snnn Jan 5, 2022

edgchen1 Jan 6, 2022

pranavsharma Feb 24, 2022

Craigacp Feb 24, 2022

Craigacp Feb 25, 2022

pranavsharma Feb 25, 2022 •

edited

Loading

Craigacp Feb 25, 2022

yuslepukhin Jan 7, 2022 •

edited

Loading

edgchen1 Jan 7, 2022

snnn commented Feb 23, 2022 •

edited

Loading

		@@ -0,0 +1,46 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.
		// Licensed under the MIT License.

Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200

Fix race condition in CUDA, ROCm, and TensorRT EP GetKernelRegistry() implementations. #10200

Conversation

edgchen1 commented Jan 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pranavsharma Feb 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuslepukhin Jan 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snnn commented Feb 23, 2022 • edited Loading

pranavsharma Feb 25, 2022 •

edited

Loading

yuslepukhin Jan 7, 2022 •

edited

Loading

snnn commented Feb 23, 2022 •

edited

Loading