Skip to content

Commit

Permalink
Add post on Metal back-end.
Browse files Browse the repository at this point in the history
  • Loading branch information
maleadt committed Jun 24, 2022
1 parent 7495cb9 commit 77752bc
Showing 1 changed file with 271 additions and 0 deletions.
271 changes: 271 additions & 0 deletions post/2022-06-24-metal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
+++
title = "Technical preview: Metal.jl"
author = "Tim Besard"
abstract = """
Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1
GPUs. The back-end is built on the same foundations that make up existing
GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody
who's already programmed GPUs in Julia. In the following post I'll demonstrate
some of that functionality and explain how it works.
"""
+++

{{abstract}}

But first, note that **[Metal.jl](https://github.com/JuliaGPU/Metal.jl) is under
heavy development**: The package is considered experimental for now, as we're
still working on squashing bugs and adding essential functionality. We also
haven't optimized for performance yet. If you're interesting in using Metal.jl,
please consider contributing to its development! Most of the package is written
in Julia, and checking-out the source code is a single `Pkg.develop` away :-)

## Quick start

Start by getting a hold of the upcoming [Julia
1.8](https://julialang.org/downloads/#upcoming_release), launch it, and enter
the package manager by pressing `]`:

```text
julia> ]
pkg> add Metal
Installed Metal
```

Installation is as easy as that, and we'll automatically download the necessary
binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then,
leave the package manager by pressing backspace, import the Metal package, and
e.g. call the `versioninfo()` method for some details on the toolchain:

```text
julia> using Metal
julia> Metal.versioninfo()
macOS 13.0.0, Darwin 21.3.0
Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.1
1 device:
- Apple M1 Pro (64.000 KiB allocated)
```

And there we go! You'll note here that I'm using the upcoming macOS 13
(Ventura); this is currently the only supported operating system. We also only
support M-series GPUs, even though Metal does support other GPUs. These choices
were made to simplify development, and aren't technical limitations. In fact,
Metal.jl _does_ work on e.g. macOS Monterey with an Intel GPU, but it's an
untested combination that may suffer from bugs.


## Array programming

Just like our other GPU back-ends, Metal.jl offers an array abstraction that
greatly simplifies GPU programming. The abstraction centers around the
`MtlArray` type that can be used to manage memory and perform GPU computations:

```
# allocate + initialize
julia> a = MtlArray(rand(Float32, 2, 2))
2×2 MtlArray{Float32, 2}:
0.158752 0.836366
0.535798 0.153554
# perform some GPU-accelerated operations
julia> b = a * a
2×2 MtlArray{Float32, 2}:
0.473325 0.261202
0.167333 0.471702
# back to the CPU
julia> Array(b)
2×2 Matrix{Float32}:
0.473325 0.261202
0.167333 0.471702
```

Beyond these simple operations, Julia's higher-order array abstractions can be
used to express more complex operations without ever having to write a kernel:

```
julia> mapreduce(sin, +, a; dims=1)
1×2 MtlArray{Float32, 2}:
1.15276 0.584146
julia> cos.(a .+ 2) .* 3
2×2 MtlArray{Float32, 2}:
-2.0472 -1.25332
-2.96594 -2.60351
```

Much of this functionality comes from the
[GPUArrays.jl](https://github.com/JuliaGPU/GPUArrays.jl/) package, which
provides vendor-neutral implementations of common array operations. As a result,
`MtlArray` is already pretty capable, and should be usable with realistic
array-based applications.


## Kernel programming

Metal.jl's array operations are implemented in Julia, using our native kernel
programming capabilities and accompanying JIT-compiler. A small demonstration:

```julia
# a simple kernel that sets elements of an array to a value
function memset_kernel(array, value)
i = thread_position_in_grid_1d()
if i <= length(array)
@inbounds array[i] = value
end
return
end

a = MtlArray{Float32}(undef, 512)
@metal threads=512 grid=2 memset_kernel(a, 42)

# verify
@assert all(isequal(42), Array(a))
```

As can be seen here, we've opted to deviate slightly from the Metal Shading
Language, instead providing a programming experience that's similar to Julia's
existing back-ends. Some key differences:

- we use intrinsic functions instead of special kernel function arguments to
access properties like the thread position, grid size, ...;
- all types of arguments (buffers, indirect buffers, value-typed inputs) are
transparently converted to a GPU-compatible structure[^1];
- global (task-bound) state is used to keep track of the active device and a
queue;
- compute pipeline set-up and command encoding is hidden behind a single macro.

Behind the scenes, we compile Julia to LLVM IR and use a [tiny LLVM
back-end](https://github.com/JuliaGPU/llvm-metal) (based on
[@a2flo](https://github.com/a2flo)'s [libfloor](https://github.com/a2flo/floor))
that (re)writes the bitcode to a Metal-compatible library containing LLVM 5
bitcode. You can inspect the generated IR using `@device_code_metal`:

```
julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42)
```
```text
[header]
program_count: 1
...
[program]
name: julia_memset_kernel
type: kernel
...
```
```llvm
target datalayout = "..."
target triple = "air64-apple-macosx13.0.0"
; the (rewritten) kernel function:
; - %value argument passed by reference
; - %thread_position_in_grid argument added
; - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel(
{ i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array,
i64 addrspace(1)* %value,
i32 %thread_position_in_grid) {
...
%9 = tail call float @air.convert.f.f32.s.i64(i64 %7)
...
ret void
}
; minimal required argument metadata
!air.kernel = !{!10}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*,
i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12}
!12 = !{!13, !14, !15}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1,
!"air.read_write", !"air.address_space", i32 1,
!"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1,
!"air.read_write", !"air.address_space", i32 1,
!"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!15 = !{i32 0, !"air.thread_position_in_grid"}
; other metadata not shown, for brevity
```

Shout-out to [@max-Hawkins](https://github.com/max-Hawkins) for exploring Metal
code generation during his internship at Julia Computing!


## Metal APIs in Julia

Lacking an Objective C or C++ FFI, we interface with the Metal libraries using
[a shim C library](https://github.com/recp/cmt). Most users won't have to
interface with Metal directly -- the array abstraction is sufficient for many --
but more experienced developers can make use of the high-level wrappers that
we've designed for the Metal APIs:

```julia
julia> dev = MtlDevice(1)
MtlDevice:
name: Apple M1 Pro
lowpower: false
headless: false
removable: false
unified memory: true

julia> desc = MtlHeapDescriptor()
MtlHeapDescriptor:
type: MtHeapTypeAutomatic
storageMode: MtStorageModePrivate
size: 0

julia> desc.size = 16384
16384

julia> heap = MtlHeap(dev, desc)
MtlHeap:
type: MtHeapTypeAutomatic
size: 16384
usedSize: 0
currentAllocatedSize: 16384

# etc
```

These wrappers are based on [@PhilipVinc](https://github.com/PhilipVinc)'s
excellent work on MetalCore.jl, which formed the basis for (and has been folded
into) Metal.jl.


## What's next?

The current release of Metal.jl focusses on code generation capabilities, and is
meant as a preview for users and developers to try out on their system or with
their specific GPU application. It is not production-ready yet, and is lacking
some crucial features:

- performance optimization
- integration with Metal Performance Shaders
- integration / documentation for use with Xcode tools
- fleshing out the array abstraction based on user feedback

**Please consider helping out with any of these!** Since Metal.jl and its
dependencies are almost entirely implemented in Julia, any experience with the
language is sufficient to contribute. If you're not certain, or have any
questions, please drop by the `#gpu` channel on [the JuliaLang
Slack](https://julialang.org/slack/), ask questions on our
[Discourse](https://discourse.julialang.org/c/domain/gpu/11), or chat to us
during the [GPU office hours](https://julialang.org/community/#events) every
other Monday.

If you encounter any bugs, feel free to let us know on the [Metal.jl issue
tracker](https://github.com/JuliaGPU/Metal.jl/issues). For information on
upcoming releases, [subscribe](https://juliagpu.org/post/) to this website's
blog were we post about significant developments in Julia's GPU world.

---

[^1]: This relies on Metal 3 from macOS 13, which introduced bindless argument
buffers, as we didn't fully figure out how to reliably encode
arbitrarily-nested indirect buffers in argument encoder metadata.

0 comments on commit 77752bc

Please sign in to comment.