intelblas_gemm clean patch #19

listenlink · 2017-02-13T06:44:42Z

This patch clean the hack code on tuning and add external profile of intelblas_gemm(image) for float32 dtype, please have a review, thx.

Change-Id: Ic9edf18a3ae0f41b21c2ac374d50000fc5d4e6f3

Change-Id: I89f632e2598594805e24b6aa2d084dcfa1c4f218

… the same operatin.

v2: by zhigang, fix some warnings and remove half relative code. v3: by lixiang, modify json file v4: by junkai, optimize gemm image kernel and force isaac to run gemm image kernel. v5: by junkai, change json file to force issac to run gemm image kernel. Change-Id: Ieab41924476bfc001f7026fbea3b5ea5e56eb00b

We need to use image related API for Intel's image based kernels. Also fixed a memory leak issue in the image based kernel. Signed-off-by: Zhigang Gong <[email protected]>

buffer kernel when input sizes are too large. v2: fix minor issues when release images. And also avoid use OpenCL library directly. Signed-off-by: Zhigang Gong <[email protected]>

gongzg · 2017-02-15T06:01:37Z

@ptillet ping for review. Thanks.

ptillet · 2017-02-15T06:13:53Z

I've looked at it and it looks good to me. I'll try it tomorrow to make sure there is no regression on AMD and NVidia HW.

ptillet · 2017-02-15T22:25:57Z

This looks great. Very good performance improvements! Thanks :)

I only have one minor question: is it possible to remove the OpenCL warnings? Successful BLAS calls are not expected to write anything on cout or cerr.

listenlink · 2017-02-16T07:35:24Z

@ptillet , I just remove the ocl building warnings, please check it again, thanks.

ptillet · 2017-02-16T19:04:00Z

The warnings are gone, cool.

However, I've tested the PR on another machine and have the following segmentation fault:

BENCH	M	N	K	AT	BT	ISAAC
Deep	1760	16	1760	N	N	0.17	
Deep	1760	32	1760	N	N	0.31	
Deep	1760	64	1760	N	N	0.35	
Deep	1760	128	1760	N	N	0.40	
Deep	1760	7000	1760	N	N	0.49	
Deep	2048	16	2048	N	N	0.19	
Deep	2048	32	2048	N	N	0.33	
Deep	2048	64	2048	N	N	0.38	
Deep	2048	128	2048	N	N	0.36	
Deep	2048	7000	2048	N	N	0.48	
Deep	2560	16	2560	N	N	0.21	
Deep	2560	32	2560	N	N	0.37	
Deep	2560	64	2560	N	N	0.36	
Deep	2560	128	2560	N	N	0.38	
Deep	2560	7000	2560	N	N	0.47	
Deep	1760	16	1760	T	N	0.11	
Deep	1760	32	1760	T	N	0.22	

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7551629 in isaac::driver::dispatch::f_impl<&isaac::driver::dispatch::clinit, _cl_mem* (*)(_cl_context*, unsigned long, _cl_image_format const*, _cl_image_desc const*, void*, int*), _cl_context*, unsigned long, _cl_image_format const*, _cl_image_desc const*, void*, int*> (lib_h=@0x7ffff7dd9920: 0x617ad0, 
    cache=@0x7ffff7dd9a38: 0x0, name=0x7ffff75abe37 "clCreateImage") at /tmp/isaac/include/isaac/driver/dispatch.h:74
#2  0x00007ffff754f4cd in isaac::driver::dispatch::clCreateImage (a=0x620010, b=1, c=0x7fffffffc830, d=0x7fffffffc910, e=0x0, f=0x7fffffffc7ac)
    at /tmp/isaac/lib/driver/dispatch.cpp:177
#3  0x00007ffff74fce02 in isaac::templates::intelblas_gemm_image::enqueue (this=0xcad900, queue=..., program=..., suffix="14", control=...)
    at /tmp/isaac/lib/jit/generation/gemm.cpp:829

I'm looking into it...

ptillet · 2017-02-16T19:41:45Z

It was a configuration problem with my machine (conflicts with the Altera OpenCL SDK).

…st_reduce reduce the skips for test_reduce functions

reduce the skips for test_reduce functions

intelblas_gemm clean patch

When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 #22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 #23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 #24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 #25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

* Handle tensor memory allocation in a naive bump pointer way * Handle basic lowering of tensor memory ld * Handle lowering of the base pointer for the tensor memory

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 triton-lang#1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 triton-lang#2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 triton-lang#3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 triton-lang#4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 triton-lang#5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 triton-lang#6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 triton-lang#7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 triton-lang#8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 triton-lang#9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 triton-lang#10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 triton-lang#11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 triton-lang#12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 triton-lang#13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 triton-lang#14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 triton-lang#15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 triton-lang#16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 triton-lang#17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 triton-lang#18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 triton-lang#19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 triton-lang#20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 triton-lang#21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 triton-lang#22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 triton-lang#23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 triton-lang#24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 triton-lang#25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

wujunkai166 and others added 8 commits February 10, 2017 10:33

integrate intelblas_gemm.

ce1c65d

Change-Id: Ic9edf18a3ae0f41b21c2ac374d50000fc5d4e6f3

Added log printing of external kernels

bf28d43

Change-Id: I89f632e2598594805e24b6aa2d084dcfa1c4f218

Modified external template bench print for more than one template for…

f04f866

… the same operatin.

Fix some minor issues and update json files for Intel platforms.

4b5cf48

Introduce some new dispatch APIs for OCL.

b882fe9

We need to use image related API for Intel's image based kernels. Also fixed a memory leak issue in the image based kernel. Signed-off-by: Zhigang Gong <[email protected]>

Add image block version of gemm kernel. No need to fall back to

1bc3be6

buffer kernel when input sizes are too large. v2: fix minor issues when release images. And also avoid use OpenCL library directly. Signed-off-by: Zhigang Gong <[email protected]>

Added intelblas python expternal scripts and updated json files

bc9121d

Ocl building warnings removed

461b83d

ptillet merged commit 9b97c56 into triton-lang:master Feb 16, 2017

codego7250 pushed a commit to codego7250/triton that referenced this pull request Nov 21, 2022

Merge pull request triton-lang#19 from ROCmSoftwarePlatform/unskip_te…

cc6b518

…st_reduce reduce the skips for test_reduce functions

goostavz pushed a commit to goostavz/triton that referenced this pull request Aug 4, 2023

[TRANSFORM] Pipeline + TMA (triton-lang#19)

75c9c5a

ptillet pushed a commit that referenced this pull request Apr 1, 2024

Merge pull request #19 from ROCmSoftwarePlatform/unskip_test_reduce

77b1332

reduce the skips for test_reduce functions

ptillet added a commit that referenced this pull request Apr 1, 2024

Merge pull request #19 from listenlink/upstream

e4daa7d

intelblas_gemm clean patch

oraluben pushed a commit to oraluben/triton that referenced this pull request Sep 11, 2024

[CPU] Dump human-readable asm code in TRITON_CACHE_DIR (triton-lang#19)

b89b2e7

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

gglin001 pushed a commit to gglin001/triton that referenced this pull request Nov 13, 2024

[CPU] Dump human-readable asm code in TRITON_CACHE_DIR (triton-lang#19)

e264ba8

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

stephen-huan pushed a commit to stephen-huan/triton that referenced this pull request Dec 24, 2024

[CPU] Dump human-readable asm code in TRITON_CACHE_DIR (triton-lang#19)

0a033bf

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intelblas_gemm clean patch #19

intelblas_gemm clean patch #19

listenlink commented Feb 13, 2017

gongzg commented Feb 15, 2017

ptillet commented Feb 15, 2017

ptillet commented Feb 15, 2017

listenlink commented Feb 16, 2017

ptillet commented Feb 16, 2017 •

edited

Loading

ptillet commented Feb 16, 2017

intelblas_gemm clean patch #19

intelblas_gemm clean patch #19

Conversation

listenlink commented Feb 13, 2017

gongzg commented Feb 15, 2017

ptillet commented Feb 15, 2017

ptillet commented Feb 15, 2017

listenlink commented Feb 16, 2017

ptillet commented Feb 16, 2017 • edited Loading

ptillet commented Feb 16, 2017

ptillet commented Feb 16, 2017 •

edited

Loading