why the performance gap #2734

Serenagirl · 2025-02-22T07:37:36Z

I use
ONEDNN_VERBOSE=dispatch numactl -c 4 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=auto --mb=1 ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4
on aarch64 sve256
the single cpu: 2.9GHz*256bits/32bits*2fma*2=92.8GFlops/s
the conv:ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4--2*9*9*64*3*2560*1440=114.66GFlops
114.66Gflops/92.8Gflops/s=1.23s,
but I use onednn3.4 got 60s or more,and the performance is not good,Is there a problem in my arguments?

`ONEDNN_VERBOSE=dispatch numactl -C 4 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=auto --mb=1 ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,convolution,scratchpad memory limit exceeded,src/cpu/gemm_convolution_utils.cpp:2119
onednn_verbose,primitive,create:dispatch,convolution,scratchpad memory limit exceeded,src/cpu/gemm_convolution_utils.cpp:2119
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,ref:any,,--mode=P --conv --alg=auto mb1ic64ih2560iw1440oc3oh2560ow1440kh9kw9ph4pw4dh1dw1,113.999,0.631104,201247,0.566465,201412,0.565998
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):201247 avg(ms):201412
total: 1217.63s; fill: 4.34s (0%);`

In addition, I cannot find an operator that can be optimized when using the ACL, and the fp16 precision is not supported, but my CPU hardware supports it.

`(python38) [@localhost benchdnn] $ ONEDNN_VERBOS=dispatch numactl -C 4./benchdnn --conv --mode=p --dt=f16 --dir=FWD_B --alg=winograd --mb 1 ic64ih1280iw720oc3oh1280ow720kh3kw3sh1sw1ld1w1pl1pw1
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP, nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,mem_descritors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Unsupported type. Could not find a k
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Unsupported type. Could not find a kernel

( python38 ) [@localhost benchdnn]$ Oonednn_verbose=dispatch numactl -C 4./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=winograd --mb 1 ic64ih1280i7/20oc3oh128o/w720kh3kw3kshswldhldwlpwl
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP, nthr:1
onednn_verbose,info,cpu,isa:Aarch64 SVE (256 bits)
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,mem_desciptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors.fpmath_mode,backend,exec_time
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel for F32 input
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel

The text was updated successfully, but these errors were encountered:

rcao8 · 2025-02-24T06:08:35Z

Looks the code path is ref:any, @oneapi-src/onednn-cpu-aarch64 Is there possible to optimize the code path on ARM platform?

Ryo-not-rio · 2025-02-25T16:04:43Z

It looks like the f32 memory formats used in the problem isn't current supported in ComputeLibrary, and we currently don't support f16 convs with a f32 bias

Serenagirl · 2025-02-26T09:36:40Z

It looks like the f32 memory formats used in the problem isn't current supported in ComputeLibrary, and we currently don't support f16 convs with a f32 bias

`(python38) ONEDNN_VERBOSE=dispatch numactl -C 4 ./benchdnn --conv --mode=P --dt=f16:f16:f16 --dir=FWD_B --alg=auto --mb=1 ic64ih1280iw720oc64oh1280ow720kh3kw3sh1sw1dh1dw1ph1pw1
0:SKIPPED (DATA_TYPE_NOT_SUPPORTED) __REPRO: --mode=P --conv --dt=f16:f16:f16 --alg=auto mb1ic64ih1280iw720oc64oh1280ow720kh3kw3ph1pw1dh1dw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,,--mode=P --conv --dt=f16:f16:f16 --alg=auto mb1ic64ih1280iw720oc64oh1280ow720kh3kw3ph1pw1dh1dw1,67.7022,0,0,0,0,0
tests:1 passed:0 skipped:1 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0 avg(ms):0
total: 0.01s; fill: 0.00s (0%);
ONEDNN_VERBOSE=dispatch numactl -C 4 ./benchdnn --conv --mode=P --dt=f16:f16:f16 --dir=FWD_B --alg=auto --mb=1 ic64ih1280iw720oc64oh1280ow720kh3kw3sh1sw1dh1dw1ph1pw1
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel`

I explicitly specified the data format of bias, but it is still not supported. The upper one is onednn, and the lower one uses acl.
but the document supports it.

https://oneapi-src.github.io/oneDNN/v3.3/dev_guide_convolution.html

Ryo-not-rio · 2025-02-27T12:41:48Z

@Serenagirl could you please try your command ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=auto --mb=1 ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4 on the latest oneDNN? I'm seeing 50x faster numbers than your numbers on my end with one thread

Serenagirl added the question label Feb 22, 2025

vpirogov assigned onednnsupporttriage Feb 22, 2025

rcao8 self-assigned this Feb 24, 2025

shu1chen added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Feb 24, 2025

shu1chen unassigned onednnsupporttriage Feb 24, 2025

vpirogov assigned Ryo-not-rio and unassigned rcao8 Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why the performance gap #2734

why the performance gap #2734

Serenagirl commented Feb 22, 2025 •

edited

Loading

rcao8 commented Feb 24, 2025

Ryo-not-rio commented Feb 25, 2025

Serenagirl commented Feb 26, 2025 •

edited

Loading

Ryo-not-rio commented Feb 27, 2025

why the performance gap #2734

why the performance gap #2734

Comments

Serenagirl commented Feb 22, 2025 • edited Loading

rcao8 commented Feb 24, 2025

Ryo-not-rio commented Feb 25, 2025

Serenagirl commented Feb 26, 2025 • edited Loading

Ryo-not-rio commented Feb 27, 2025

Serenagirl commented Feb 22, 2025 •

edited

Loading

Serenagirl commented Feb 26, 2025 •

edited

Loading