Why is the performance of the mkl_conv in the torch test worse than that of the benchdnn test? #2672

Serenagirl · 2025-02-12T02:19:57Z

I tested aten::mkldnn_convolutioin in pytorch:

import torch
import torch.nn.functional as F
from torch.profiler import profile, record_function, ProfilerActivity

input_tensor = torch.randn(1, 64, 170, 256)  # [batch_size, in_channels, height, width]
weight_tensor = torch.randn(64, 64, 3, 3)    # [out_channels, in_channels, kernel_height, kernel_width]
bias_tensor = torch.randn(64)               # [out_channels]

stride = (1, 1)
padding = (1, 1)
dilation = (1, 1)
groups = 1

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("aten::mkldnn_convolution"):
        output = F.conv2d(input_tensor, weight_tensor, bias_tensor, stride, padding, dilation, groups)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

and got time total:144ms,but I tested the same convolution in benchdnn:
OMP_NUM_THREADS=1 taskset -c 0 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --mb=1 ic64ih170iw256oc64oh168ow254kh3kw3sh1sw1dh1dw1ph1pw1

total perf: min(ms):349.276 avg(ms):349.385
total: 3.54s; fill: 0.05s (1%);

why the difference

The text was updated successfully, but these errors were encountered:

rupakroyintel · 2025-02-12T05:50:21Z

@Serenagirl Thanks for reaching out.
Can you share the following information:

oneDNN Verbose log for the PyTorch code execution?
Platform information
Also, is there any particular reason for setting OMP_NUM_THREADS=1 for the benchdnn test?

Here are some best practices for Configuring oneDNN for Benchmarking.

rupakroyintel · 2025-02-20T19:44:55Z

@Serenagirl Can you please share share the details?

Serenagirl · 2025-02-22T07:45:24Z

@Serenagirl Can you please share share the details?

Sorry to reply so late.I wanted to test the performance of a single core. My platform is aarch64,and the theoretical computing power of the CPU is 92.8GFlops/s single core,the 8.95025 below looks too small
ONEDNN_VERBOSE=1 numactl -C 4 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --mb=1 ic64ih170iw256oc64oh168ow254kh3kw3sh1sw1dh1dw1ph1pw1
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a::f0 dst_f32::blocked:a::f0,,,64,0.00805664
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:abcd::f0,,,64x64x3x3,0.00610352
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:abcd::f0,,,1x64x170x256,0.415039
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,332.818
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.191
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.313
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.234
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.122
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.039
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.128
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.423
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.124
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.035
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv mb1ic64ih170iw256oc64oh168ow254kh3kw3ph1pw1dh1dw1,3.12541,0.381104,349.051,8.95403,349.198,8.95025
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):349.051 avg(ms):349.198
total: 3.54s; fill: 0.05s (1%);

rupakroyintel · 2025-02-24T07:14:34Z

@oneapi-src/onednn-cpu-aarch64 It looks like a reference implementation is being used instead of an optimized one on the aarch64 platform for the given shape. Can you please look into this performance issue?

Serenagirl added the question label Feb 12, 2025

shu1chen assigned onednnsupporttriage Feb 12, 2025

rupakroyintel assigned rupakroyintel and unassigned onednnsupporttriage Feb 12, 2025

rupakroyintel added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the performance of the mkl_conv in the torch test worse than that of the benchdnn test? #2672

Why is the performance of the mkl_conv in the torch test worse than that of the benchdnn test? #2672

Serenagirl commented Feb 12, 2025 •

edited

Loading

rupakroyintel commented Feb 12, 2025 •

edited

Loading

rupakroyintel commented Feb 20, 2025

Serenagirl commented Feb 22, 2025 •

edited

Loading

rupakroyintel commented Feb 24, 2025

Why is the performance of the mkl_conv in the torch test worse than that of the benchdnn test? #2672

Why is the performance of the mkl_conv in the torch test worse than that of the benchdnn test? #2672

Comments

Serenagirl commented Feb 12, 2025 • edited Loading

rupakroyintel commented Feb 12, 2025 • edited Loading

rupakroyintel commented Feb 20, 2025

Serenagirl commented Feb 22, 2025 • edited Loading

rupakroyintel commented Feb 24, 2025

Serenagirl commented Feb 12, 2025 •

edited

Loading

rupakroyintel commented Feb 12, 2025 •

edited

Loading

Serenagirl commented Feb 22, 2025 •

edited

Loading