-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent inference timing on CPU #10270
Comments
Is your intend to use only one thread for computation while your CPU has 8 cores? You can tune intra_op_num_threads (like 4) and inter_op_num_threads to see whether there is less spike. |
@tianleiwu the intention is to run in a single thread on a single core. I ran benchmarking using the options you suggested but I still see these inference time spikes occurring. I believe this is due to memory allocs which occur during the forward inference call. Is there a way to avoid non deterministic operations (such as memory allocs) during inference ? |
@jamjambles, you can tune session options to see whether it could help: There are some optimization (like Reshape etc) in master branch that could reduce memory allocation. You can try nightly build If the problem still exists, please share your model and test script so that someone could help debugging the issue. |
@tianleiwu thanks for getting back to me! Looking at the API docs I think the following options are related to memory allocation:
Are you able to tell me how to set these in order to minimise the number of memory allocations? I tried the nightly build and the spikes are still occurring. |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Hi @tianleiwu!
Also, @jamjambles, have you managed to solve this issue about such instability? I have opened similar issue about allocations when running with c++ api #17758 |
@SolomidHero, to reduce memory allocation, you can try set arena settings (see C++ API https://onnxruntime.ai/docs/api/c/struct_ort_1_1_arena_cfg.html), like arena_extend_strategy to be 1 (kSameAsRequested). It could avoid Arena using more memory than needed. If you want to avoid allocation happens, you can set a large initial arena buffer. See https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/memory.html if you have multiple sessions. You can share Arena among multiple sessions to reduce memory. You can also try mimalloc to see whether it can help. If your input/output tensor is large, please use I/O Binding (https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/iobinding.html), that could save extra memory needed and avoid data copy (thus reduce latency). For instability: In https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/troubleshooting.html, it also mentioned that when latency has high variance like this issue, you can try settings like If your machine has NUMA (usually in machine with > 32 cores), try set thread affinity to pin to some CPU cores: You can also build from source with a flag If your application has mixed use of PyTorch, Numpy and OnnxRuntime, try to move everything (like preprocessing and postprocessing) to OnnxRuntime custom op. It is because each of them has their own thread pool and memory management. Thread switch will introduce extra latency. |
@tianleiwu Thank you for such fast response!
Can you confirm that following options (1-4) you described are helpful for sequential setup too to fix instabilities?
EnableCpuMemArena();
EnableMemPattern();
|
@SolomidHero, I think "session.dynamic_block_base" and "session.intra_op.allow_spinning" has impact on OS thread scheduling, and you can have a try. For multiple sessions, I recommend to try using one process per session, and use affinity to pin each process to some CPU cores. Without this, your thread might be moved by OS to another core, that might cause latency spiking. Latency spiking has many causes, you will need monitor system level CPU, memory and process status; Sometime, overheating could cause OS throttling CPU/GPU speed (but that will have impact on longer time so it is likely the cause). If OS level memory usage is high, memory paging might cause latency spiking. Sometime, thread is suspended by OS for some reason. check system event logs to see anything happens at the time. You may add some logging to see whether the input shape is large if your input has dynamic shape. Or whether there is no input for some period so session becomes "cold", then new request will take longer time. If no causes found using the above ways, I think investigating the root cause might need some advanced profiling tools. |
@tianleiwu I am quite confused about existence of thread pool. From your words it is assumed that in setup "intra_op=1, inter_op=1 and sequential execution" - there must be some new threads that can be managed by some options. But from @snnn answer in #11347 I assume no thread created Some analysisI also tried to profile setup of three sequential model running inference in a equally distant time calls. Say I wanted to run in 50Hz on single thread. Here is my results:
env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "common_env");
OrtThreadingOptions* tp_options = nullptr;
Ort::GetApi().CreateThreadingOptions(&tp_options);
Ort::GetApi().SetGlobalIntraOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalInterOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalSpinControl(tp_options, 0); // i disabled it since it gives high cpu usage, and we don't need such
env = Ort::Env(tp_options, ORT_LOGGING_LEVEL_WARNING, "global_env");
Ort::GetApi().ReleaseThreadingOptions(tp_options); See also distribution difference: I just wanted to share such result. For me it is clear that (if @snnn is correct on "no new thread in intra_op=1, inter_op=1 and sequential execution setup") only difference is (1) is without any new threads and (2) uses single thread for processing models and spends a lot time on synchronizing with main thread. p.s. I used |
@SolomidHero, For the global thread pool, it is right that there is no thread pool created since the setting of number of threads is 1. And node is executed directly in the main thread. If you are interested in the detail of thread pool see the below code: onnxruntime/onnxruntime/core/util/thread_utils.cc Lines 87 to 89 in 7417fd4
onnxruntime/onnxruntime/core/session/inference_session.h Lines 556 to 566 in 7417fd4
onnxruntime/onnxruntime/core/session/environment.cc Lines 203 to 215 in 7417fd4
onnxruntime/onnxruntime/core/session/inference_session.cc Lines 364 to 406 in 7417fd4
onnxruntime/include/onnxruntime/core/platform/threadpool.h Lines 228 to 235 in 7417fd4
|
I was reading code, but also tried to empirically get improvement. SetInterOpNumThreads(1);
SetIntraOpNumThreads(1);
EnableCpuMemArena(); // DisableCpuMemArena();
EnableMemPattern(); // DisableMemPattern();
AddConfigEntry("session.use_env_allocators", "1");
env = std::make_unique<Ort::Env>(std::move(Ort::Env(ORT_LOGGING_LEVEL_WARNING, "common_env")));
memoryInfo = std::make_unique<Ort::MemoryInfo>(std::move(Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU)));
auto& api = Ort::GetApi();
const char* keys[] = {
"max_mem",
"arena_extend_strategy",
"initial_chunk_size_bytes",
"max_dead_bytes_per_chunk",
"initial_growth_chunk_size_bytes"
};
const size_t values[] = {
0,
1, // kNextPowerOfTwo==0, kSameAsRequested==1 // i tried both 0 and 1
0, // 300 * 1024 * 1024, // 1024 * 1024,
// 0, // 0,
// 0, // 1024 * 256,
};
OrtArenaCfg* arena_cfg;
auto status = api.CreateArenaCfgV2(keys, values, sizeof(values) / sizeof(size_t), &arena_cfg);
env->CreateAndRegisterAllocator(*memoryInfo, arena_cfg); In the comments are values and options that i tried. Is it possible that shared allocator might be not used and its env options are not used too when there is sequential setup? I still read code to find answers 😔 |
I log the inference time by myself. When I call enableprofiling func, i can reproduce the issue. But when i disableprofiling it is good. Hope the information is helpful for you. |
Describe the bug
I am deploying a model using ONNXRuntime on CPU. It is in an environment with a hard real-time budget of 20ms. The average inference time of the model is ~2ms. I have observed that sometimes the model inference time spikes to >> 20ms which is not acceptable in the deployment environment. More generally there is quite a lot of variance between inference calls in terms of timing which is also a concern.
System information
Additional context
These are the session options used to initialise the ORT inference session
Below is an example of inference times per inference call. This shows the inconsistent inference timing and spikes:


Here is a trace using the profiling tool from onnxruntime, again we can see where the inference locks-up causing a spike:
I am looking to learn whether onnxruntime is appropriate for such a use-case. Are there any other options or configurations to ensure stability in inference time?
The text was updated successfully, but these errors were encountered: