-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
271 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,271 @@ | ||
+++ | ||
title = "Technical preview: Metal.jl" | ||
author = "Tim Besard" | ||
abstract = """ | ||
Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1 | ||
GPUs. The back-end is built on the same foundations that make up existing | ||
GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody | ||
who's already programmed GPUs in Julia. In the following post I'll demonstrate | ||
some of that functionality and explain how it works. | ||
""" | ||
+++ | ||
|
||
{{abstract}} | ||
|
||
But first, note that **[Metal.jl](https://github.com/JuliaGPU/Metal.jl) is under | ||
heavy development**: The package is considered experimental for now, as we're | ||
still working on squashing bugs and adding essential functionality. We also | ||
haven't optimized for performance yet. If you're interesting in using Metal.jl, | ||
please consider contributing to its development! Most of the package is written | ||
in Julia, and checking-out the source code is a single `Pkg.develop` away :-) | ||
|
||
## Quick start | ||
|
||
Start by getting a hold of the upcoming [Julia | ||
1.8](https://julialang.org/downloads/#upcoming_release), launch it, and enter | ||
the package manager by pressing `]`: | ||
|
||
```text | ||
julia> ] | ||
pkg> add Metal | ||
Installed Metal | ||
``` | ||
|
||
Installation is as easy as that, and we'll automatically download the necessary | ||
binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then, | ||
leave the package manager by pressing backspace, import the Metal package, and | ||
e.g. call the `versioninfo()` method for some details on the toolchain: | ||
|
||
```text | ||
julia> using Metal | ||
julia> Metal.versioninfo() | ||
macOS 13.0.0, Darwin 21.3.0 | ||
Toolchain: | ||
- Julia: 1.8.0-rc1 | ||
- LLVM: 13.0.1 | ||
1 device: | ||
- Apple M1 Pro (64.000 KiB allocated) | ||
``` | ||
|
||
And there we go! You'll note here that I'm using the upcoming macOS 13 | ||
(Ventura); this is currently the only supported operating system. We also only | ||
support M-series GPUs, even though Metal does support other GPUs. These choices | ||
were made to simplify development, and aren't technical limitations. In fact, | ||
Metal.jl _does_ work on e.g. macOS Monterey with an Intel GPU, but it's an | ||
untested combination that may suffer from bugs. | ||
|
||
|
||
## Array programming | ||
|
||
Just like our other GPU back-ends, Metal.jl offers an array abstraction that | ||
greatly simplifies GPU programming. The abstraction centers around the | ||
`MtlArray` type that can be used to manage memory and perform GPU computations: | ||
|
||
``` | ||
# allocate + initialize | ||
julia> a = MtlArray(rand(Float32, 2, 2)) | ||
2×2 MtlArray{Float32, 2}: | ||
0.158752 0.836366 | ||
0.535798 0.153554 | ||
# perform some GPU-accelerated operations | ||
julia> b = a * a | ||
2×2 MtlArray{Float32, 2}: | ||
0.473325 0.261202 | ||
0.167333 0.471702 | ||
# back to the CPU | ||
julia> Array(b) | ||
2×2 Matrix{Float32}: | ||
0.473325 0.261202 | ||
0.167333 0.471702 | ||
``` | ||
|
||
Beyond these simple operations, Julia's higher-order array abstractions can be | ||
used to express more complex operations without ever having to write a kernel: | ||
|
||
``` | ||
julia> mapreduce(sin, +, a; dims=1) | ||
1×2 MtlArray{Float32, 2}: | ||
1.15276 0.584146 | ||
julia> cos.(a .+ 2) .* 3 | ||
2×2 MtlArray{Float32, 2}: | ||
-2.0472 -1.25332 | ||
-2.96594 -2.60351 | ||
``` | ||
|
||
Much of this functionality comes from the | ||
[GPUArrays.jl](https://github.com/JuliaGPU/GPUArrays.jl/) package, which | ||
provides vendor-neutral implementations of common array operations. As a result, | ||
`MtlArray` is already pretty capable, and should be usable with realistic | ||
array-based applications. | ||
|
||
|
||
## Kernel programming | ||
|
||
Metal.jl's array operations are implemented in Julia, using our native kernel | ||
programming capabilities and accompanying JIT-compiler. A small demonstration: | ||
|
||
```julia | ||
# a simple kernel that sets elements of an array to a value | ||
function memset_kernel(array, value) | ||
i = thread_position_in_grid_1d() | ||
if i <= length(array) | ||
@inbounds array[i] = value | ||
end | ||
return | ||
end | ||
|
||
a = MtlArray{Float32}(undef, 512) | ||
@metal threads=512 grid=2 memset_kernel(a, 42) | ||
|
||
# verify | ||
@assert all(isequal(42), Array(a)) | ||
``` | ||
|
||
As can be seen here, we've opted to deviate slightly from the Metal Shading | ||
Language, instead providing a programming experience that's similar to Julia's | ||
existing back-ends. Some key differences: | ||
|
||
- we use intrinsic functions instead of special kernel function arguments to | ||
access properties like the thread position, grid size, ...; | ||
- all types of arguments (buffers, indirect buffers, value-typed inputs) are | ||
transparently converted to a GPU-compatible structure[^1]; | ||
- global (task-bound) state is used to keep track of the active device and a | ||
queue; | ||
- compute pipeline set-up and command encoding is hidden behind a single macro. | ||
|
||
Behind the scenes, we compile Julia to LLVM IR and use a [tiny LLVM | ||
back-end](https://github.com/JuliaGPU/llvm-metal) (based on | ||
[@a2flo](https://github.com/a2flo)'s [libfloor](https://github.com/a2flo/floor)) | ||
that (re)writes the bitcode to a Metal-compatible library containing LLVM 5 | ||
bitcode. You can inspect the generated IR using `@device_code_metal`: | ||
|
||
``` | ||
julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42) | ||
``` | ||
```text | ||
[header] | ||
program_count: 1 | ||
... | ||
[program] | ||
name: julia_memset_kernel | ||
type: kernel | ||
... | ||
``` | ||
```llvm | ||
target datalayout = "..." | ||
target triple = "air64-apple-macosx13.0.0" | ||
; the (rewritten) kernel function: | ||
; - %value argument passed by reference | ||
; - %thread_position_in_grid argument added | ||
; - sitofp rewritten to AIR-specific intrinsic | ||
define void @julia_memset_kernel( | ||
{ i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array, | ||
i64 addrspace(1)* %value, | ||
i32 %thread_position_in_grid) { | ||
... | ||
%9 = tail call float @air.convert.f.f32.s.i64(i64 %7) | ||
... | ||
ret void | ||
} | ||
; minimal required argument metadata | ||
!air.kernel = !{!10} | ||
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*, | ||
i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12} | ||
!12 = !{!13, !14, !15} | ||
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1, | ||
!"air.read_write", !"air.address_space", i32 1, | ||
!"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8} | ||
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1, | ||
!"air.read_write", !"air.address_space", i32 1, | ||
!"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8} | ||
!15 = !{i32 0, !"air.thread_position_in_grid"} | ||
; other metadata not shown, for brevity | ||
``` | ||
|
||
Shout-out to [@max-Hawkins](https://github.com/max-Hawkins) for exploring Metal | ||
code generation during his internship at Julia Computing! | ||
|
||
|
||
## Metal APIs in Julia | ||
|
||
Lacking an Objective C or C++ FFI, we interface with the Metal libraries using | ||
[a shim C library](https://github.com/recp/cmt). Most users won't have to | ||
interface with Metal directly -- the array abstraction is sufficient for many -- | ||
but more experienced developers can make use of the high-level wrappers that | ||
we've designed for the Metal APIs: | ||
|
||
```julia | ||
julia> dev = MtlDevice(1) | ||
MtlDevice: | ||
name: Apple M1 Pro | ||
lowpower: false | ||
headless: false | ||
removable: false | ||
unified memory: true | ||
|
||
julia> desc = MtlHeapDescriptor() | ||
MtlHeapDescriptor: | ||
type: MtHeapTypeAutomatic | ||
storageMode: MtStorageModePrivate | ||
size: 0 | ||
|
||
julia> desc.size = 16384 | ||
16384 | ||
|
||
julia> heap = MtlHeap(dev, desc) | ||
MtlHeap: | ||
type: MtHeapTypeAutomatic | ||
size: 16384 | ||
usedSize: 0 | ||
currentAllocatedSize: 16384 | ||
|
||
# etc | ||
``` | ||
|
||
These wrappers are based on [@PhilipVinc](https://github.com/PhilipVinc)'s | ||
excellent work on MetalCore.jl, which formed the basis for (and has been folded | ||
into) Metal.jl. | ||
|
||
|
||
## What's next? | ||
|
||
The current release of Metal.jl focusses on code generation capabilities, and is | ||
meant as a preview for users and developers to try out on their system or with | ||
their specific GPU application. It is not production-ready yet, and is lacking | ||
some crucial features: | ||
|
||
- performance optimization | ||
- integration with Metal Performance Shaders | ||
- integration / documentation for use with Xcode tools | ||
- fleshing out the array abstraction based on user feedback | ||
|
||
**Please consider helping out with any of these!** Since Metal.jl and its | ||
dependencies are almost entirely implemented in Julia, any experience with the | ||
language is sufficient to contribute. If you're not certain, or have any | ||
questions, please drop by the `#gpu` channel on [the JuliaLang | ||
Slack](https://julialang.org/slack/), ask questions on our | ||
[Discourse](https://discourse.julialang.org/c/domain/gpu/11), or chat to us | ||
during the [GPU office hours](https://julialang.org/community/#events) every | ||
other Monday. | ||
|
||
If you encounter any bugs, feel free to let us know on the [Metal.jl issue | ||
tracker](https://github.com/JuliaGPU/Metal.jl/issues). For information on | ||
upcoming releases, [subscribe](https://juliagpu.org/post/) to this website's | ||
blog were we post about significant developments in Julia's GPU world. | ||
|
||
--- | ||
|
||
[^1]: This relies on Metal 3 from macOS 13, which introduced bindless argument | ||
buffers, as we didn't fully figure out how to reliably encode | ||
arbitrarily-nested indirect buffers in argument encoder metadata. |