Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dmitriim/rebase main #217

Draft
wants to merge 165 commits into
base: dev-upstream_main
Choose a base branch
from

Conversation

Devjiu
Copy link
Collaborator

@Devjiu Devjiu commented Feb 21, 2025

To Verify rebase

bertmaher and others added 30 commits February 19, 2025 15:48
* [BACKEND][CPU] Implement the empty cpu backend

* Run clang-format

* Fix yadf error

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
A quick addition on how to use it.
Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`.


Test Plan: It can safely compile an empty kernel.

```
@triton.jit
def add_kernel(x_ptr,  y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    return
```

> TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py

```
//===-------------------------------------------===//
Legalizing operation : 'tt.func'(0x73be2a0) {
  * Fold {
  } -> FAILURE : unable to fold

  * Pattern : 'tt.func -> ()' {
Trying to match "(anonymous namespace)::FuncOpConversion"
    ** Insert  : 'llvm.func'(0x6c04c70)
    ** Insert Block into : 'llvm.func'(0x6c04c70)
    ** Insert Block into : 'llvm.func'(0x6c04c70)
    ** Erase   : 'tt.func'(0x73be2a0)
"(anonymous namespace)::FuncOpConversion" result 1

    //===-------------------------------------------===//
    Legalizing operation : 'llvm.func'(0x6c04c70) {
    } -> SUCCESS : operation marked legal by the target
    //===-------------------------------------------===//

...

//===-------------------------------------------===//
Legalizing operation : 'tt.return'(0x73efeb0) {
  "tt.return"() : () -> ()

  * Fold {
  } -> FAILURE : unable to fold

  * Pattern : 'tt.return -> ()' {
Trying to match "(anonymous namespace)::ReturnOpConversion"
    ** Insert  : 'llvm.return'(0x73c0f00)
    ** Replace : 'tt.return'(0x73efeb0)
"(anonymous namespace)::ReturnOpConversion" result 1

    //===-------------------------------------------===//
    Legalizing operation : 'llvm.return'(0x73c0f00) {
      "llvm.return"() : () -> ()

    } -> SUCCESS : operation marked legal by the target
    //===-------------------------------------------===//
  } -> SUCCESS : pattern applied successfully
```
…riton-lang#1)

Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments.

I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing.

Test Plan: Tested with a simple example:

```
@triton.jit
def add_kernel(...):
    pid = tl.program_id(axis=0)  # We use a 1D launch grid so axis is 0.
    foo = pid + 42
    tl.device_print("Hello, World!", foo, pid)
```

The resulting .llir is valid:
```
@printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00"

declare !dbg !3 i32 @printf(ptr, ...)

define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 {
  %8 = add i32 %4, 42, !dbg !8
  %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4)
  ret void, !dbg !9
}
```

Tried to compile with a fake main function:
```
> % cat main.c
extern void add_kernel(float*, float*, float*, int, int, int, int);

int main() {
    add_kernel(0, 0, 0, 4, 5, 6, 7);
}

> % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c
> % ./a.out
pid (5, 6, 7) Hello, World!: 47, 5
```
…ion flows (triton-lang#6)

* Support basic lowering through vector dialect in CPU backend.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use axis info in memory op lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Mark test_ptx_cast as enabled for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support umulhi operation.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.clamp, tl.minimum, tl.maximum.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add enable_fp_fusion opt for CPU (only affects ASM dump now).

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix kernel args passing for propagated constants.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add permutations support.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 2-D transfer_read/transfer_write lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Introduce shape info analysis and use it for loads/stores by block pointers.

Delay scalar pointers lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 'other' arg for loads.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.join.

Signed-off-by: Ilya Enkovich <[email protected]>

* Minor renaming.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
…ent (triton-lang#8)

* [BACKEND][CPU] Make it buildable and runnable in a different environment

* Revert seemingly inconsistent python code formatting
Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Minjang Kim <[email protected]>
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
…ng#17)

* Fixed yaml syntax

Signed-off-by: Gregory Shimansky <[email protected]>

* Removed cpu label from run-on

Signed-off-by: Gregory Shimansky <[email protected]>

* Added missing zlib-dev

Signed-off-by: Gregory Shimansky <[email protected]>

* Added missing apt-get update

Signed-off-by: Gregory Shimansky <[email protected]>

* Remove pip cache because on self-hosted runner it slows things down

Signed-off-by: Gregory Shimansky <[email protected]>

* Corrected path to tests

Signed-off-by: Gregory Shimansky <[email protected]>

* Added installation of torch==2.1.2

Signed-off-by: Gregory Shimansky <[email protected]>

---------

Signed-off-by: Gregory Shimansky <[email protected]>
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR

* Don't touch the main compiler.py
…-lang#23)

* add un-masked tiled matrix-multiplication for triton-cpu

* clean and add comment

* move test under tutorials
* Fix RelWithDebInfo build.

Signed-off-by: Ilya Enkovich <[email protected]>

* Skip fp8 cast tests on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix segfault.

Signed-off-by: Ilya Enkovich <[email protected]>

* [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059)

* Add -s option to pytest run.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add a workaround for LLVM bug causing test failure on Skylake CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add a workaround for LLVM fpext bug causing test failure on Skylake CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix formatting.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Pablo Zimmermann <[email protected]>
digantdesai and others added 30 commits February 20, 2025 17:07
Summary: Follow the triton-lang#165 example, and update one macro for building on linux-aarch64.
We only use Sleef on !x86 platforms. Sleef APIs are not fully agnostic
of the underlying architecture. For example, `Sleef_sinf8_u10` does not
exist on Arm.

This PR, makes the `MathToVecLibPass` aware of the
CPU SIMD architecture by accepting `cpu_features` as new optional
argument.

No change is expected on x86 side.
Tested on my M1 Mac as,

```
OMP_NUM_THREADS=8                                          \
TRITON_LOCAL_LIBOMP_PATH="<path..to>/site-packages/torch/" \
CC=$(which clang)                                          \
TRITON_CPU_BACKEND=1                                       \
$(which python3)                                           \
python/tutorials/02-fused-softmax-cpu.py
```
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
* Fix isSigned in PrintOp

* Add float16 support for print

* Support float16 printing for old compilers
* Add pass to decompose matmul to FMA operations.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use block pointers and padding in 03-matrix-multiplication-cpu.py.

* Fix review comments.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
* Improve AMX lowering to minimize loads and stores.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support bfloat16 in CPU matmul tutorials.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
* [Setup] Skip hatchet pip package for now

This does not exist for Darwin + Arm64.
TODO: Enable this selectively when possible.

* [CPU][driver] Skip non-existent sys paths

* [mac-arm64] Add GH CI support

- look into faster triton install
- enable bf16 tests
- enable openmp
First try to lookup the Target in the given module.
If it doesn't work, use the default target. And set it in the module.

Rationale: Issue triton-lang#207
This PR introduces Ukernels api to allow usage of third party libraries such as OneDNN. Those libraries allows to call effective implementations for brgemm/transform and some other ops. So I am replacing triton_cpu.dot op when it's possible with call of kernel from library.

Signed-off-by: Dmitrii Makarenko <[email protected]>
Co-authored-by: Ilya Enkovich <[email protected]>
This commits adds missing headers to runtime files.

Resolves: triton-lang#180

Signed-off-by: Dmitrii Makarenko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.