Inconsistent inference timing on CPU #10270

jamjambles · 2022-01-13T07:01:07Z

Describe the bug
I am deploying a model using ONNXRuntime on CPU. It is in an environment with a hard real-time budget of 20ms. The average inference time of the model is ~2ms. I have observed that sometimes the model inference time spikes to >> 20ms which is not acceptable in the deployment environment. More generally there is quite a lot of variance between inference calls in terms of timing which is also a concern.

System information

Mac OSX 10.15.7
Processor: 2.4 GHz 8-Core Intel Core i9
ONNX Runtime version: 1.10.2
Python version: 3.8.12

Additional context
These are the session options used to initialise the ORT inference session

    sess_options = onnxruntime.SessionOptions()
    sess_options.intra_op_num_threads = 1
    sess_options.inter_op_num_threads = 1
    sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
    sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL

Below is an example of inference times per inference call. This shows the inconsistent inference timing and spikes:

Here is a trace using the profiling tool from onnxruntime, again we can see where the inference locks-up causing a spike:

I am looking to learn whether onnxruntime is appropriate for such a use-case. Are there any other options or configurations to ensure stability in inference time?

The text was updated successfully, but these errors were encountered:

tianleiwu · 2022-01-13T19:01:41Z

Is your intend to use only one thread for computation while your CPU has 8 cores? You can tune intra_op_num_threads (like 4) and inter_op_num_threads to see whether there is less spike.

jamjambles · 2022-01-19T22:19:57Z

@tianleiwu the intention is to run in a single thread on a single core. I ran benchmarking using the options you suggested but I still see these inference time spikes occurring.

I believe this is due to memory allocs which occur during the forward inference call. Is there a way to avoid non deterministic operations (such as memory allocs) during inference ?

tianleiwu · 2022-01-19T23:22:06Z

@jamjambles, you can tune session options to see whether it could help:
http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html#sessionoptions

There are some optimization (like Reshape etc) in master branch that could reduce memory allocation. You can try nightly build
https://test.pypi.org/project/ort-nightly/ to see whether it could improve it.

If the problem still exists, please share your model and test script so that someone could help debugging the issue.

jamjambles · 2022-01-20T00:04:48Z

@tianleiwu thanks for getting back to me!

Looking at the API docs I think the following options are related to memory allocation:

sess_options.use_deterministic_compute
sess_options.enable_mem_reuse
sess_options.enable_cpu_mem_arena
sess_options.enable_mem_pattern

Are you able to tell me how to set these in order to minimise the number of memory allocations?

I tried the nightly build and the spikes are still occurring.

stale · 2022-04-16T07:54:49Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

SolomidHero · 2023-10-05T14:11:37Z

Hi @tianleiwu!
I would like to know answers to these questions too

Are you able to tell me how to set these in order to minimise the number of memory allocations?

Also, @jamjambles, have you managed to solve this issue about such instability?

I have opened similar issue about allocations when running with c++ api #17758

tianleiwu · 2023-10-05T15:15:56Z

@SolomidHero, to reduce memory allocation, you can try set arena settings (see C++ API https://onnxruntime.ai/docs/api/c/struct_ort_1_1_arena_cfg.html), like arena_extend_strategy to be 1 (kSameAsRequested). It could avoid Arena using more memory than needed. If you want to avoid allocation happens, you can set a large initial arena buffer.

See https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/memory.html if you have multiple sessions. You can share Arena among multiple sessions to reduce memory. You can also try mimalloc to see whether it can help.

If your input/output tensor is large, please use I/O Binding (https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/iobinding.html), that could save extra memory needed and avoid data copy (thus reduce latency).

For instability:

In https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/troubleshooting.html, it also mentioned that when latency has high variance like this issue, you can try settings like sess_options.add_session_config_entry('session.dynamic_block_base', '4')
Or try change enable/disable spinning:
sess_options.add_session_config_entry("session.intra_op.allow_spinning", "1")

If your machine has NUMA (usually in machine with > 32 cores), try set thread affinity to pin to some CPU cores:
https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance/threading.html#set-intra-op-thread-affinity.

You can also build from source with a flag --use_lock_free_queue to see whether it could help.

If your application has mixed use of PyTorch, Numpy and OnnxRuntime, try to move everything (like preprocessing and postprocessing) to OnnxRuntime custom op. It is because each of them has their own thread pool and memory management. Thread switch will introduce extra latency.

SolomidHero · 2023-10-05T15:58:38Z

@tianleiwu Thank you for such fast response!
When such options are useful for multi-threaded environment I am trying to resolve instabilities for sequential inference.
Say base config is same as for topic starter:

sess_options.intra_op_num_threads = 1
sess_options.inter_op_num_threads = 1
sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
# and CPU as only execution provider

Can you confirm that following options (1-4) you described are helpful for sequential setup too to fix instabilities?

arena_extend_strategy for excessive amount of memory arena so no additional allocation needed
I personally use:

EnableCpuMemArena();
EnableMemPattern();

"sharing allocator(s) between sessions" - don't think it could really impact additional allocs/deallocs when several sessions are defined in isolation. I.e. I guess it doesn't affect instabilities, but memory size only. So it is good technique to only reduce size, right?
sequential -> no thread pool -> no impact for "session.dynamic_block_base" and "session.intra_op.allow_spinning" on instabilities in sequential setup, right?
From this issue Possible to run inference purely sequentially and not have onnx pin a thread #11347 I assume doing intra_op=1, inter_op=1 and sequential does not pin additional thread, what is great I think. But execution time variance is still high though as in issue opening message.
io binding used successfully, it is really useful thing :) But seems like instabilities still there

tianleiwu · 2023-10-05T16:58:48Z

@SolomidHero, I think "session.dynamic_block_base" and "session.intra_op.allow_spinning" has impact on OS thread scheduling, and you can have a try.

For multiple sessions, I recommend to try using one process per session, and use affinity to pin each process to some CPU cores. Without this, your thread might be moved by OS to another core, that might cause latency spiking.

Latency spiking has many causes, you will need monitor system level CPU, memory and process status;

Sometime, overheating could cause OS throttling CPU/GPU speed (but that will have impact on longer time so it is likely the cause).

If OS level memory usage is high, memory paging might cause latency spiking.

Sometime, thread is suspended by OS for some reason. check system event logs to see anything happens at the time.

You may add some logging to see whether the input shape is large if your input has dynamic shape. Or whether there is no input for some period so session becomes "cold", then new request will take longer time.

If no causes found using the above ways, I think investigating the root cause might need some advanced profiling tools.

SolomidHero · 2023-10-05T17:43:14Z

@tianleiwu I am quite confused about existence of thread pool. From your words it is assumed that in setup "intra_op=1, inter_op=1 and sequential execution" - there must be some new threads that can be managed by some options. But from @snnn answer in #11347 I assume no thread created

Some analysis

I also tried to profile setup of three sequential model running inference in a equally distant time calls. Say I wanted to run in 50Hz on single thread. Here is my results:

isolated - same env for all sessions - no global thread pool

env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "common_env");

global - same env for all sessions - env with global thread pool

OrtThreadingOptions* tp_options = nullptr;
Ort::GetApi().CreateThreadingOptions(&tp_options);
Ort::GetApi().SetGlobalIntraOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalInterOpNumThreads(tp_options, 1);
Ort::GetApi().SetGlobalSpinControl(tp_options, 0); // i disabled it since it gives high cpu usage, and we don't need such
env = Ort::Env(tp_options, ORT_LOGGING_LEVEL_WARNING, "global_env");
Ort::GetApi().ReleaseThreadingOptions(tp_options);

See also distribution difference:

I just wanted to share such result. For me it is clear that (if @snnn is correct on "no new thread in intra_op=1, inter_op=1 and sequential execution setup") only difference is (1) is without any new threads and (2) uses single thread for processing models and spends a lot time on synchronizing with main thread.
I also don't know if there any other difference. For example, in behaviour of memory management. Can you shed light on this diff? Can you explain why in isolated there might be two-modal kde plot?

p.s. I used onnxruntime==1.14.1 on macos

tianleiwu · 2023-10-06T00:18:11Z

@SolomidHero, For the global thread pool, it is right that there is no thread pool created since the setting of number of threads is 1. And node is executed directly in the main thread.

If you are interested in the detail of thread pool see the below code:

onnxruntime/onnxruntime/core/util/thread_utils.cc

Lines 87 to 89 in 7417fd4

    
           if (options.thread_pool_size <= 1) { 
        
             return nullptr; 
        
           }

onnxruntime/onnxruntime/core/session/inference_session.h

Lines 556 to 566 in 7417fd4

    
           onnxruntime::concurrency::ThreadPool* GetIntraOpThreadPoolToUse() const { 
        
             if (session_options_.use_per_session_threads) { 
        
               if (external_intra_op_thread_pool_) { 
        
                 return external_intra_op_thread_pool_; 
        
               } else { 
        
                 return thread_pool_.get(); 
        
               } 
        
             } else { 
        
               return intra_op_thread_pool_from_env_; 
        
             } 
        
           }

onnxruntime/onnxruntime/core/session/environment.cc

Lines 203 to 215 in 7417fd4

    
           if (create_global_thread_pools) { 
        
             create_global_thread_pools_ = true; 
        
             OrtThreadPoolParams to = tp_options->intra_op_thread_pool_params; 
        
             if (to.name == nullptr) { 
        
               to.name = ORT_TSTR("intra-op"); 
        
             } 
        
             intra_op_thread_pool_ = concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTRA_OP); 
        
             to = tp_options->inter_op_thread_pool_params; 
        
             if (to.name == nullptr) { 
        
               to.name = ORT_TSTR("inter-op"); 
        
             } 
        
             inter_op_thread_pool_ = concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTER_OP); 
        
           }

onnxruntime/onnxruntime/core/session/inference_session.cc

Lines 364 to 406 in 7417fd4

    
           use_per_session_threads_ = session_options.use_per_session_threads; 
        
           force_spinning_stop_between_runs_ = session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigForceSpinningStop, "0") == "1"; 
        
           if (use_per_session_threads_) { 
        
             LOGS(*session_logger_, INFO) << "Creating and using per session threadpools since use_per_session_threads_ is true"; 
        
             { 
        
               if (!external_intra_op_thread_pool_) { 
        
                 bool allow_intra_op_spinning = 
        
                     session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigAllowIntraOpSpinning, "1") == "1"; 
        
                 OrtThreadPoolParams to = session_options_.intra_op_param; 
        
                 std::basic_stringstream<ORTCHAR_T> ss; 
        
                 if (to.name) { 
        
                   ss << to.name << ORT_TSTR("-"); 
        
                 } 
        
                 ss << ORT_TSTR("session-") << session_id_ << ORT_TSTR("-intra-op"); 
        
                 thread_pool_name_ = ss.str(); 
        
                 to.name = thread_pool_name_.c_str(); 
        
                 to.set_denormal_as_zero = set_denormal_as_zero; 
        
                 // If the thread pool can use all the processors, then 
        
                 // we set affinity of each thread to each processor. 
        
                 to.allow_spinning = allow_intra_op_spinning; 
        
                 to.dynamic_block_base_ = std::stoi(session_options_.config_options.GetConfigOrDefault(kOrtSessionOptionsConfigDynamicBlockBase, "0")); 
        
                 LOGS(*session_logger_, INFO) << "Dynamic block base set to " << to.dynamic_block_base_; 
        
                 // Set custom threading functions 
        
                 to.custom_create_thread_fn = session_options_.custom_create_thread_fn; 
        
                 to.custom_thread_creation_options = session_options.custom_thread_creation_options; 
        
                 to.custom_join_thread_fn = session_options_.custom_join_thread_fn; 
        
                 if (session_options_.config_options.TryGetConfigEntry(kOrtSessionOptionsConfigIntraOpThreadAffinities, to.affinity_str)) { 
        
                   ORT_ENFORCE(!to.affinity_str.empty(), "Affinity string must not be empty"); 
        
                 } 
        
                 to.auto_set_affinity = to.thread_pool_size == 0 && 
        
                                        session_options_.execution_mode == ExecutionMode::ORT_SEQUENTIAL && 
        
                                        to.affinity_str.empty(); 
        
                 if (to.custom_create_thread_fn) { 
        
                   ORT_ENFORCE(to.custom_join_thread_fn, "custom join thread function not set for intra op thread pool"); 
        
                 } 
        
                 thread_pool_ = 
        
                     concurrency::CreateThreadPool(&Env::Default(), to, concurrency::ThreadPoolType::INTRA_OP); 
        
               } 
        
             }

onnxruntime/include/onnxruntime/core/platform/threadpool.h

Lines 228 to 235 in 7417fd4

    
           static void Schedule(ThreadPool* tp, 
        
                                std::function<void()> fn) { 
        
             if (tp) { 
        
               tp->Schedule(fn); 
        
             } else { 
        
               fn(); 
        
             } 
        
           }

SolomidHero · 2023-10-10T13:45:57Z

I was reading code, but also tried to empirically get improvement.
So I tried different session options and options for allocator:

SetInterOpNumThreads(1);
SetIntraOpNumThreads(1);
EnableCpuMemArena(); // DisableCpuMemArena();
EnableMemPattern(); // DisableMemPattern();

AddConfigEntry("session.use_env_allocators", "1");

env = std::make_unique<Ort::Env>(std::move(Ort::Env(ORT_LOGGING_LEVEL_WARNING, "common_env")));
memoryInfo = std::make_unique<Ort::MemoryInfo>(std::move(Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU)));
auto& api = Ort::GetApi();

const char* keys[] = {
    "max_mem",
    "arena_extend_strategy",
    "initial_chunk_size_bytes",
    "max_dead_bytes_per_chunk",
    "initial_growth_chunk_size_bytes"
};
const size_t values[] = {
    0,
    1, // kNextPowerOfTwo==0, kSameAsRequested==1 // i tried both 0 and 1
    0, // 300 * 1024 * 1024, // 1024 * 1024,
    // 0, // 0,
    // 0, // 1024 * 256,
};
OrtArenaCfg* arena_cfg;
auto status = api.CreateArenaCfgV2(keys, values, sizeof(values) / sizeof(size_t), &arena_cfg);
env->CreateAndRegisterAllocator(*memoryInfo, arena_cfg);

In the comments are values and options that i tried.
So i created 3 sessions of different models on same env and memoryInfo. And I couldn't get different results! My program alway was consuming 201-204MB of data and processing time in loop was always about the same mean and std (so statically insignificant).
So even MemPattern and CpuMemArena has no impact

Is it possible that shared allocator might be not used and its env options are not used too when there is sequential setup? I still read code to find answers 😔

jiangfeizi · 2024-11-27T01:57:10Z

I log the inference time by myself. When I call enableprofiling func, i can reproduce the issue. But when i disableprofiling it is good. Hope the information is helpful for you.

tianleiwu added the type:performance label Jan 13, 2022

stale bot added the stale issues that have not been addressed in a while; categorized by a bot label Apr 16, 2022

sophies927 added core runtime issues related to core runtime and removed type:performance labels Aug 12, 2022

stale bot removed the stale issues that have not been addressed in a while; categorized by a bot label Aug 12, 2022

SolomidHero mentioned this issue Oct 10, 2023

Segfault on session creation with custom MockedOrtAllocator in MlasSgemmCopyPackB #17867

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent inference timing on CPU #10270

Inconsistent inference timing on CPU #10270

jamjambles commented Jan 13, 2022

tianleiwu commented Jan 13, 2022

jamjambles commented Jan 19, 2022

tianleiwu commented Jan 19, 2022

jamjambles commented Jan 20, 2022

stale bot commented Apr 16, 2022

SolomidHero commented Oct 5, 2023

tianleiwu commented Oct 5, 2023 •

edited

Loading

SolomidHero commented Oct 5, 2023

tianleiwu commented Oct 5, 2023 •

edited

Loading

SolomidHero commented Oct 5, 2023 •

edited

Loading

tianleiwu commented Oct 6, 2023

SolomidHero commented Oct 10, 2023 •

edited

Loading

jiangfeizi commented Nov 27, 2024 •

edited

Loading

Inconsistent inference timing on CPU #10270

Inconsistent inference timing on CPU #10270

Comments

jamjambles commented Jan 13, 2022

tianleiwu commented Jan 13, 2022

jamjambles commented Jan 19, 2022

tianleiwu commented Jan 19, 2022

jamjambles commented Jan 20, 2022

stale bot commented Apr 16, 2022

SolomidHero commented Oct 5, 2023

tianleiwu commented Oct 5, 2023 • edited Loading

SolomidHero commented Oct 5, 2023

tianleiwu commented Oct 5, 2023 • edited Loading

SolomidHero commented Oct 5, 2023 • edited Loading

Some analysis

tianleiwu commented Oct 6, 2023

SolomidHero commented Oct 10, 2023 • edited Loading

jiangfeizi commented Nov 27, 2024 • edited Loading

tianleiwu commented Oct 5, 2023 •

edited

Loading

tianleiwu commented Oct 5, 2023 •

edited

Loading

SolomidHero commented Oct 5, 2023 •

edited

Loading

SolomidHero commented Oct 10, 2023 •

edited

Loading

jiangfeizi commented Nov 27, 2024 •

edited

Loading