forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dmitriim/rebase main #217
Draft
Devjiu
wants to merge
165
commits into
triton-lang:dev-upstream_main
Choose a base branch
from
Devjiu:dmitriim/rebase_main
base: dev-upstream_main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Dmitriim/rebase main #217
Devjiu
wants to merge
165
commits into
triton-lang:dev-upstream_main
from
Devjiu:dmitriim/rebase_main
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
A quick addition on how to use it.
Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`. Test Plan: It can safely compile an empty kernel. ``` @triton.jit def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): return ``` > TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py ``` //===-------------------------------------------===// Legalizing operation : 'tt.func'(0x73be2a0) { * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.func -> ()' { Trying to match "(anonymous namespace)::FuncOpConversion" ** Insert : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Erase : 'tt.func'(0x73be2a0) "(anonymous namespace)::FuncOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.func'(0x6c04c70) { } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// ... //===-------------------------------------------===// Legalizing operation : 'tt.return'(0x73efeb0) { "tt.return"() : () -> () * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.return -> ()' { Trying to match "(anonymous namespace)::ReturnOpConversion" ** Insert : 'llvm.return'(0x73c0f00) ** Replace : 'tt.return'(0x73efeb0) "(anonymous namespace)::ReturnOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.return'(0x73c0f00) { "llvm.return"() : () -> () } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// } -> SUCCESS : pattern applied successfully ```
…riton-lang#1) Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments. I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing. Test Plan: Tested with a simple example: ``` @triton.jit def add_kernel(...): pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0. foo = pid + 42 tl.device_print("Hello, World!", foo, pid) ``` The resulting .llir is valid: ``` @printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00" declare !dbg !3 i32 @printf(ptr, ...) define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 { %8 = add i32 %4, 42, !dbg !8 %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4) ret void, !dbg !9 } ``` Tried to compile with a fake main function: ``` > % cat main.c extern void add_kernel(float*, float*, float*, int, int, int, int); int main() { add_kernel(0, 0, 0, 4, 5, 6, 7); } > % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c > % ./a.out pid (5, 6, 7) Hello, World!: 47, 5 ```
Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Shane Nay <[email protected]>
…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>
…ion flows (triton-lang#6) * Support basic lowering through vector dialect in CPU backend. Signed-off-by: Ilya Enkovich <[email protected]> * Use axis info in memory op lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Mark test_ptx_cast as enabled for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Support umulhi operation. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.clamp, tl.minimum, tl.maximum. Signed-off-by: Ilya Enkovich <[email protected]> * Add enable_fp_fusion opt for CPU (only affects ASM dump now). Signed-off-by: Ilya Enkovich <[email protected]> * Fix kernel args passing for propagated constants. Signed-off-by: Ilya Enkovich <[email protected]> * Add permutations support. Signed-off-by: Ilya Enkovich <[email protected]> * Support 2-D transfer_read/transfer_write lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Introduce shape info analysis and use it for loads/stores by block pointers. Delay scalar pointers lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Support 'other' arg for loads. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.join. Signed-off-by: Ilya Enkovich <[email protected]> * Minor renaming. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>
…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU
…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>
…ng#17) * Fixed yaml syntax Signed-off-by: Gregory Shimansky <[email protected]> * Removed cpu label from run-on Signed-off-by: Gregory Shimansky <[email protected]> * Added missing zlib-dev Signed-off-by: Gregory Shimansky <[email protected]> * Added missing apt-get update Signed-off-by: Gregory Shimansky <[email protected]> * Remove pip cache because on self-hosted runner it slows things down Signed-off-by: Gregory Shimansky <[email protected]> * Corrected path to tests Signed-off-by: Gregory Shimansky <[email protected]> * Added installation of torch==2.1.2 Signed-off-by: Gregory Shimansky <[email protected]> --------- Signed-off-by: Gregory Shimansky <[email protected]>
* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add
Signed-off-by: Ilya Enkovich <[email protected]>
* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py
Signed-off-by: Gregory Shimansky <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials
* Fix RelWithDebInfo build. Signed-off-by: Ilya Enkovich <[email protected]> * Skip fp8 cast tests on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix segfault. Signed-off-by: Ilya Enkovich <[email protected]> * [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059) * Add -s option to pytest run. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM fpext bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix formatting. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Pablo Zimmermann <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
) Signed-off-by: Ilya Enkovich <[email protected]>
Summary: Follow the triton-lang#165 example, and update one macro for building on linux-aarch64.
We only use Sleef on !x86 platforms. Sleef APIs are not fully agnostic of the underlying architecture. For example, `Sleef_sinf8_u10` does not exist on Arm. This PR, makes the `MathToVecLibPass` aware of the CPU SIMD architecture by accepting `cpu_features` as new optional argument. No change is expected on x86 side.
Tested on my M1 Mac as, ``` OMP_NUM_THREADS=8 \ TRITON_LOCAL_LIBOMP_PATH="<path..to>/site-packages/torch/" \ CC=$(which clang) \ TRITON_CPU_BACKEND=1 \ $(which python3) \ python/tutorials/02-fused-softmax-cpu.py ```
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
* Fix isSigned in PrintOp * Add float16 support for print * Support float16 printing for old compilers
Signed-off-by: Ilya Enkovich <[email protected]>
* Add pass to decompose matmul to FMA operations. Signed-off-by: Ilya Enkovich <[email protected]> * Use block pointers and padding in 03-matrix-multiplication-cpu.py. * Fix review comments. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
* Improve AMX lowering to minimize loads and stores. Signed-off-by: Ilya Enkovich <[email protected]> * Support bfloat16 in CPU matmul tutorials. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
* [Setup] Skip hatchet pip package for now This does not exist for Darwin + Arm64. TODO: Enable this selectively when possible. * [CPU][driver] Skip non-existent sys paths * [mac-arm64] Add GH CI support - look into faster triton install - enable bf16 tests - enable openmp
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
First try to lookup the Target in the given module. If it doesn't work, use the default target. And set it in the module. Rationale: Issue triton-lang#207
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library. Signed-off-by: Dmitrii Makarenko <[email protected]> Co-authored-by: Ilya Enkovich <[email protected]>
This commits adds missing headers to runtime files. Resolves: triton-lang#180 Signed-off-by: Dmitrii Makarenko <[email protected]>
Signed-off-by: Dmitrii Makarenko <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To Verify rebase