Dmitriim/rebase main #217

Devjiu · 2025-02-21T12:24:24Z

To Verify rebase

* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

A quick addition on how to use it.

Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`. Test Plan: It can safely compile an empty kernel. ``` @triton.jit def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): return ``` > TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py ``` //===-------------------------------------------===// Legalizing operation : 'tt.func'(0x73be2a0) { * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.func -> ()' { Trying to match "(anonymous namespace)::FuncOpConversion" ** Insert : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Erase : 'tt.func'(0x73be2a0) "(anonymous namespace)::FuncOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.func'(0x6c04c70) { } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// ... //===-------------------------------------------===// Legalizing operation : 'tt.return'(0x73efeb0) { "tt.return"() : () -> () * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.return -> ()' { Trying to match "(anonymous namespace)::ReturnOpConversion" ** Insert : 'llvm.return'(0x73c0f00) ** Replace : 'tt.return'(0x73efeb0) "(anonymous namespace)::ReturnOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.return'(0x73c0f00) { "llvm.return"() : () -> () } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// } -> SUCCESS : pattern applied successfully ```

@printf

…riton-lang#1) Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments. I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing. Test Plan: Tested with a simple example: ``` @triton.jit def add_kernel(...): pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0. foo = pid + 42 tl.device_print("Hello, World!", foo, pid) ``` The resulting .llir is valid: ``` @printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00" declare !dbg !3 i32 @printf(ptr, ...) define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 { %8 = add i32 %4, 42, !dbg !8 %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4) ret void, !dbg !9 } ``` Tried to compile with a fake main function: ``` > % cat main.c extern void add_kernel(float*, float*, float*, int, int, int, int); int main() { add_kernel(0, 0, 0, 4, 5, 6, 7); } > % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c > % ./a.out pid (5, 6, 7) Hello, World!: 47, 5 ```

Signed-off-by: Ilya Enkovich <[email protected]>

Co-authored-by: Shane Nay <[email protected]>

…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>

…ion flows (triton-lang#6) * Support basic lowering through vector dialect in CPU backend. Signed-off-by: Ilya Enkovich <[email protected]> * Use axis info in memory op lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Mark test_ptx_cast as enabled for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Support umulhi operation. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.clamp, tl.minimum, tl.maximum. Signed-off-by: Ilya Enkovich <[email protected]> * Add enable_fp_fusion opt for CPU (only affects ASM dump now). Signed-off-by: Ilya Enkovich <[email protected]> * Fix kernel args passing for propagated constants. Signed-off-by: Ilya Enkovich <[email protected]> * Add permutations support. Signed-off-by: Ilya Enkovich <[email protected]> * Support 2-D transfer_read/transfer_write lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Introduce shape info analysis and use it for loads/stores by block pointers. Delay scalar pointers lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Support 'other' arg for loads. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.join. Signed-off-by: Ilya Enkovich <[email protected]> * Minor renaming. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting

Signed-off-by: Ilya Enkovich <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>

…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU

…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>

…ng#17) * Fixed yaml syntax Signed-off-by: Gregory Shimansky <[email protected]> * Removed cpu label from run-on Signed-off-by: Gregory Shimansky <[email protected]> * Added missing zlib-dev Signed-off-by: Gregory Shimansky <[email protected]> * Added missing apt-get update Signed-off-by: Gregory Shimansky <[email protected]> * Remove pip cache because on self-hosted runner it slows things down Signed-off-by: Gregory Shimansky <[email protected]> * Corrected path to tests Signed-off-by: Gregory Shimansky <[email protected]> * Added installation of torch==2.1.2 Signed-off-by: Gregory Shimansky <[email protected]> --------- Signed-off-by: Gregory Shimansky <[email protected]>

* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add

Signed-off-by: Ilya Enkovich <[email protected]>

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

Signed-off-by: Gregory Shimansky <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials

…ffer for CPU. (triton-lang#24)

* Fix RelWithDebInfo build. Signed-off-by: Ilya Enkovich <[email protected]> * Skip fp8 cast tests on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix segfault. Signed-off-by: Ilya Enkovich <[email protected]> * [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059) * Add -s option to pytest run. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM fpext bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix formatting. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Pablo Zimmermann <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

) Signed-off-by: Ilya Enkovich <[email protected]>

Summary: Follow the triton-lang#165 example, and update one macro for building on linux-aarch64.

We only use Sleef on !x86 platforms. Sleef APIs are not fully agnostic of the underlying architecture. For example, `Sleef_sinf8_u10` does not exist on Arm. This PR, makes the `MathToVecLibPass` aware of the CPU SIMD architecture by accepting `cpu_features` as new optional argument. No change is expected on x86 side.

Tested on my M1 Mac as, ``` OMP_NUM_THREADS=8 \ TRITON_LOCAL_LIBOMP_PATH="<path..to>/site-packages/torch/" \ CC=$(which clang) \ TRITON_CPU_BACKEND=1 \ $(which python3) \ python/tutorials/02-fused-softmax-cpu.py ```

Closes triton-lang#171.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix isSigned in PrintOp * Add float16 support for print * Support float16 printing for old compilers

Signed-off-by: Ilya Enkovich <[email protected]>

* Add pass to decompose matmul to FMA operations. Signed-off-by: Ilya Enkovich <[email protected]> * Use block pointers and padding in 03-matrix-multiplication-cpu.py. * Fix review comments. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

* Improve AMX lowering to minimize loads and stores. Signed-off-by: Ilya Enkovich <[email protected]> * Support bfloat16 in CPU matmul tutorials. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

Signed-off-by: Ilya Enkovich <[email protected]>

* [Setup] Skip hatchet pip package for now This does not exist for Darwin + Arm64. TODO: Enable this selectively when possible. * [CPU][driver] Skip non-existent sys paths * [mac-arm64] Add GH CI support - look into faster triton install - enable bf16 tests - enable openmp

Signed-off-by: Ilya Enkovich <[email protected]>

First try to lookup the Target in the given module. If it doesn't work, use the default target. And set it in the module. Rationale: Issue triton-lang#207

This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>

This commits adds missing headers to runtime files. Resolves: triton-lang#180 Signed-off-by: Dmitrii Makarenko <[email protected]>

Signed-off-by: Dmitrii Makarenko <[email protected]>

bertmaher and others added 30 commits February 19, 2025 15:48

Short preamble for the README, explaining why this clone exists

38ada4a

OSS Automated Fix: Addition of Code of Conduct (triton-lang#1)

8600c20

[BACKEND][CPU] Initial plumbing for cpu backend (triton-lang#2)

875b15f

* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

[BACKEND][CPU] Create TritonCPU and conversion dialects (triton-lang#3)

015e50d

Update README.md

3df2a1d

A quick addition on how to use it.

Quick patches to make it work after rebasing (triton-lang#3)

1f61335

Support basic lowering through vector dialect in CPU backend.

7b56e5a

Signed-off-by: Ilya Enkovich <[email protected]>

Revert unreviewed changes. (triton-lang#5)

e603b00

Co-authored-by: Shane Nay <[email protected]>

Add a workaround for LLVM bug in codegen for bf16 vector cast. (trito…

31ad8a1

…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>

Add support for tl.cat operation. (triton-lang#9)

e1dd10e

Signed-off-by: Ilya Enkovich <[email protected]>

[BACKEND][CPU] Make it buildable and runnable in a different environm…

7b183cf

…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting

Add support for simple reductions. (triton-lang#10)

d6df9c1

Signed-off-by: Ilya Enkovich <[email protected]>

Support tl.histogram for CPU. (triton-lang#12)

ad823a3

Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>

Fix merge and compile errors (triton-lang#13)

61a99a0

[CPU] Support flexible active driver + update vector-add tutorial (tr…

1c0986c

…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU

Added a simple workflow to run on self-hosted intel runner (triton-la…

0db8651

…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>

[CPU] Add an OpenMP-based CPU launcher (triton-lang#15)

7922469

* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add

Support generic reduction and scan cases. (triton-lang#14)

508dff5

Signed-off-by: Ilya Enkovich <[email protected]>

[CPU] Dump human-readable asm code in TRITON_CACHE_DIR (triton-lang#19)

e93ef5a

* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py

Added g++ installation after switching to ubuntu-22.04 (triton-lang#21)

ff40f16

Signed-off-by: Gregory Shimansky <[email protected]>

Support atomic ops for CPU. (triton-lang#20)

dbc68ed

Signed-off-by: Ilya Enkovich <[email protected]>

[TUTORIAL] Add unmasked matrix multiply example to triton-cpu (triton…

2af366d

…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials

Update matrix-multiplication-cpu tutorial, use preallocated output bu…

b8334a4

…ffer for CPU. (triton-lang#24)

Use static compilation for kernels. (triton-lang#29)

45d02ad

Signed-off-by: Ilya Enkovich <[email protected]>

Move byte manipulation ops from elwise ops conversion. (triton-lang#28)

f768deb

Signed-off-by: Ilya Enkovich <[email protected]>

digantdesai and others added 30 commits February 20, 2025 17:07

Small fixes for clang + macosx (triton-lang#173)

4a778e6

Support multi-dimensional tensor prints in CPU runtime. (triton-lang#174

3073466

) Signed-off-by: Ilya Enkovich <[email protected]>

Fix linux-aarch64 build (triton-lang#176)

fbdcbfc

Summary: Follow the triton-lang#165 example, and update one macro for building on linux-aarch64.

Allow using local omp with Apple clang (triton-lang#181)

c0cbf97

Tested on my M1 Mac as, ``` OMP_NUM_THREADS=8 \ TRITON_LOCAL_LIBOMP_PATH="<path..to>/site-packages/torch/" \ CC=$(which clang) \ TRITON_CPU_BACKEND=1 \ $(which python3) \ python/tutorials/02-fused-softmax-cpu.py ```

Add pytest.mark.cpu to two more already-passing tests (triton-lang#183)

217591b

Move libdevice to third_party (triton-lang#182)

f1a54c4

Closes triton-lang#171.

Introduce triton_cpu.DotOp.

682cc03

Signed-off-by: Ilya Enkovich <[email protected]>

Fixes to use the latest LLVM.

74a3488

Signed-off-by: Ilya Enkovich <[email protected]>

Fix pybind11 build issue for TritonCPU.

ad46864

Signed-off-by: Ilya Enkovich <[email protected]>

Use mlir::amx::TileType.

8ca15da

Signed-off-by: Ilya Enkovich <[email protected]>

Fix formatting

fcada66

Fix test_tl_range.

f2d3208

Signed-off-by: Ilya Enkovich <[email protected]>

Fix test_conversions.

4361d34

Signed-off-by: Ilya Enkovich <[email protected]>

Disable test_block_copy with lower bound check.

dccfd79

Signed-off-by: Ilya Enkovich <[email protected]>

Fix isSigned and add float16 in PrintOp (triton-lang#191)

a509dd9

* Fix isSigned in PrintOp * Add float16 support for print * Support float16 printing for old compilers

Add TritonCPU canonicalizer. (triton-lang#192)

df38430

Signed-off-by: Ilya Enkovich <[email protected]>

Fix extra-store in matmul tutorial. (triton-lang#198)

220b95a

Signed-off-by: Ilya Enkovich <[email protected]>

Remove unnecessary bounds checks. (triton-lang#199)

485d709

Signed-off-by: Ilya Enkovich <[email protected]>

Fix isSigned usage for scalar prints. (triton-lang#201)

5b430ee

Signed-off-by: Ilya Enkovich <[email protected]>

Support VNNI pre-encoded input in AMX lowering. (triton-lang#210)

5846858

Signed-off-by: Ilya Enkovich <[email protected]>

Update default target selection logic (triton-lang#212)

b812067

First try to lookup the Target in the given module. If it doesn't work, use the default target. And set it in the module. Rationale: Issue triton-lang#207

Add missing headers toruntime (triton-lang#215)

ac48b1f

This commits adds missing headers to runtime files. Resolves: triton-lang#180 Signed-off-by: Dmitrii Makarenko <[email protected]>

Allign with new LLVM version and remove deprecated calls.

f46cf95

Signed-off-by: Dmitrii Makarenko <[email protected]>

rebase fixes

5f23d7b

rebase issues

95e6cbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dmitriim/rebase main #217

Dmitriim/rebase main #217

Devjiu commented Feb 21, 2025

Dmitriim/rebase main #217

Are you sure you want to change the base?

Dmitriim/rebase main #217

Conversation

Devjiu commented Feb 21, 2025