Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the performance of the mkl_conv in the torch test worse than that of the benchdnn test? #2672

Open
Serenagirl opened this issue Feb 12, 2025 · 4 comments
Assignees
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question

Comments

@Serenagirl
Copy link

Serenagirl commented Feb 12, 2025

I tested aten::mkldnn_convolutioin in pytorch:

import torch
import torch.nn.functional as F
from torch.profiler import profile, record_function, ProfilerActivity

input_tensor = torch.randn(1, 64, 170, 256)  # [batch_size, in_channels, height, width]
weight_tensor = torch.randn(64, 64, 3, 3)    # [out_channels, in_channels, kernel_height, kernel_width]
bias_tensor = torch.randn(64)               # [out_channels]

stride = (1, 1)
padding = (1, 1)
dilation = (1, 1)
groups = 1

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("aten::mkldnn_convolution"):
        output = F.conv2d(input_tensor, weight_tensor, bias_tensor, stride, padding, dilation, groups)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

and got time total:144ms,but I tested the same convolution in benchdnn:
OMP_NUM_THREADS=1 taskset -c 0 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --mb=1 ic64ih170iw256oc64oh168ow254kh3kw3sh1sw1dh1dw1ph1pw1

total perf: min(ms):349.276 avg(ms):349.385
total: 3.54s; fill: 0.05s (1%);

why the difference

@rupakroyintel
Copy link

rupakroyintel commented Feb 12, 2025

@Serenagirl Thanks for reaching out.
Can you share the following information:

  • oneDNN Verbose log for the PyTorch code execution?
  • Platform information
    Also, is there any particular reason for setting OMP_NUM_THREADS=1 for the benchdnn test?

Here are some best practices for Configuring oneDNN for Benchmarking.

@rupakroyintel
Copy link

@Serenagirl Can you please share share the details?

@Serenagirl
Copy link
Author

Serenagirl commented Feb 22, 2025

@Serenagirl Can you please share share the details?

Sorry to reply so late.I wanted to test the performance of a single core. My platform is aarch64,and the theoretical computing power of the CPU is 92.8GFlops/s single core,the 8.95025 below looks too small
ONEDNN_VERBOSE=1 numactl -C 4 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --mb=1 ic64ih170iw256oc64oh168ow254kh3kw3sh1sw1dh1dw1ph1pw1
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a::f0 dst_f32::blocked:a::f0,,,64,0.00805664
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:abcd::f0,,,64x64x3x3,0.00610352
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abcd::f0 dst_f32::blocked:abcd::f0,,,1x64x170x256,0.415039
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,332.818
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.191
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.313
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.234
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.122
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.039
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.128
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.423
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.124
onednn_verbose,primitive,exec,cpu,convolution,gemm:ref,forward_training,src_f32:a:blocked:abcd::f0 wei_f32:a:blocked:abcd::f0 bia_f32:a:blocked:a::f0 dst_f32:a:blocked:abcd::f0,,alg:convolution_direct,mb1_ic64oc64_ih170oh168kh3sh1dh1ph1_iw256ow254kw3sw1dw1pw1,349.035
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,gemm:ref,,--mode=P --conv mb1ic64ih170iw256oc64oh168ow254kh3kw3ph1pw1dh1dw1,3.12541,0.381104,349.051,8.95403,349.198,8.95025
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):349.051 avg(ms):349.198
total: 3.54s; fill: 0.05s (1%);

@rupakroyintel rupakroyintel added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Feb 23, 2025
@rupakroyintel
Copy link

@oneapi-src/onednn-cpu-aarch64 It looks like a reference implementation is being used instead of an optimized one on the aarch64 platform for the given shape. Can you please look into this performance issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question
Projects
None yet
Development

No branches or pull requests

3 participants