Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why the performance gap #2734

Open
Serenagirl opened this issue Feb 22, 2025 · 4 comments
Open

why the performance gap #2734

Serenagirl opened this issue Feb 22, 2025 · 4 comments
Assignees
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question

Comments

@Serenagirl
Copy link

Serenagirl commented Feb 22, 2025

I use
ONEDNN_VERBOSE=dispatch numactl -c 4 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=auto --mb=1 ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4
on aarch64 sve256
the single cpu: 2.9GHz*256bits/32bits*2fma*2=92.8GFlops/s
the conv:ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4--2*9*9*64*3*2560*1440=114.66GFlops
114.66Gflops/92.8Gflops/s=1.23s,
but I use onednn3.4 got 60s or more,and the performance is not good,Is there a problem in my arguments?

`ONEDNN_VERBOSE=dispatch numactl -C 4 ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=auto --mb=1 ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,convolution,scratchpad memory limit exceeded,src/cpu/gemm_convolution_utils.cpp:2119
onednn_verbose,primitive,create:dispatch,convolution,scratchpad memory limit exceeded,src/cpu/gemm_convolution_utils.cpp:2119
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,ref:any,,--mode=P --conv --alg=auto mb1ic64ih2560iw1440oc3oh2560ow1440kh9kw9ph4pw4dh1dw1,113.999,0.631104,201247,0.566465,201412,0.565998
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):201247 avg(ms):201412
total: 1217.63s; fill: 4.34s (0%);`

In addition, I cannot find an operator that can be optimized when using the ACL, and the fp16 precision is not supported, but my CPU hardware supports it.

`(python38) [@localhost benchdnn] $ ONEDNN_VERBOS=dispatch numactl -C 4./benchdnn --conv --mode=p --dt=f16 --dir=FWD_B --alg=winograd --mb 1 ic64ih1280iw720oc3oh1280ow720kh3kw3sh1sw1ld1w1pl1pw1
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP, nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,mem_descritors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Unsupported type. Could not find a k
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Unsupported type. Could not find a kernel
( python38 ) [@localhost benchdnn]$ Oonednn_verbose=dispatch numactl -C 4./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=winograd --mb 1 ic64ih1280i7/20oc3oh128o/w720kh3kw3kshswldhldwlpwl
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP, nthr:1
onednn_verbose,info,cpu,isa:Aarch64 SVE (256 bits)
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,mem_desciptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors.fpmath_mode,backend,exec_time
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel for F32 input
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:848: We could not find an optimized kernel
@rcao8
Copy link

rcao8 commented Feb 24, 2025

Looks the code path is ref:any, @oneapi-src/onednn-cpu-aarch64 Is there possible to optimize the code path on ARM platform?

@rcao8 rcao8 self-assigned this Feb 24, 2025
@shu1chen shu1chen added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Feb 24, 2025
@Ryo-not-rio
Copy link
Contributor

It looks like the f32 memory formats used in the problem isn't current supported in ComputeLibrary, and we currently don't support f16 convs with a f32 bias

@Serenagirl
Copy link
Author

Serenagirl commented Feb 26, 2025

It looks like the f32 memory formats used in the problem isn't current supported in ComputeLibrary, and we currently don't support f16 convs with a f32 bias

`(python38) ONEDNN_VERBOSE=dispatch numactl -C 4 ./benchdnn --conv --mode=P --dt=f16:f16:f16 --dir=FWD_B --alg=auto --mb=1 ic64ih1280iw720oc64oh1280ow720kh3kw3sh1sw1dh1dw1ph1pw1
0:SKIPPED (DATA_TYPE_NOT_SUPPORTED) __REPRO: --mode=P --conv --dt=f16:f16:f16 --alg=auto mb1ic64ih1280iw720oc64oh1280ow720kh3kw3ph1pw1dh1dw1
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%+ctime%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,,--mode=P --conv --dt=f16:f16:f16 --alg=auto mb1ic64ih1280iw720oc64oh1280ow720kh3kw3ph1pw1dh1dw1,67.7022,0,0,0,0,0
tests:1 passed:0 skipped:1 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total perf: min(ms):0 avg(ms):0
total: 0.01s; fill: 0.00s (0%);
ONEDNN_VERBOSE=dispatch numactl -C 4 ./benchdnn --conv --mode=P --dt=f16:f16:f16 --dir=FWD_B --alg=auto --mb=1 ic64ih1280iw720oc64oh1280ow720kh3kw3sh1sw1dh1dw1ph1pw1
onednn_verbose,info,oneDNN v3.4.0 (commit N/A)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:1
onednn_verbose,info,cpu,isa:AArch64 SVE (256 bits)
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel
onednn_verbose,cpu,acl,unsupported: in has_opt_impl src/cpu/operators/internal/CpuGemmAssemblyDispatch.cpp:901: Usupported type. Could not find a kernel`

I explicitly specified the data format of bias, but it is still not supported. The upper one is onednn, and the lower one uses acl.
but the document supports it.

Image
https://oneapi-src.github.io/oneDNN/v3.3/dev_guide_convolution.html

@vpirogov vpirogov assigned Ryo-not-rio and unassigned rcao8 Feb 26, 2025
@Ryo-not-rio
Copy link
Contributor

@Serenagirl could you please try your command ./benchdnn --conv --mode=P --dt=f32 --dir=FWD_B --alg=auto --mb=1 ic64ih2560iw1440oc3oh2560ow1440kh9kw9sh1sw1dh1dw1ph4pw4 on the latest oneDNN? I'm seeing 50x faster numbers than your numbers on my end with one thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 question
Projects
None yet
Development

No branches or pull requests

5 participants