-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add blog post on oneAPI.jl 1.0. (#31)
- Loading branch information
Showing
1 changed file
with
121 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
+++ | ||
title = "oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9" | ||
author = "Tim Besard" | ||
abstract = """ | ||
The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel | ||
Library (oneMKL) to accelerate linear algebra operations on Intel GPUs. | ||
It also brings support for Julia 1.9 and Intel Arc GPUs.""" | ||
+++ | ||
|
||
{{abstract}} | ||
|
||
|
||
## oneMKL integration | ||
|
||
oneAPI.jl now uses the Intel oneAPI Math Kernel Library (oneMKL), automatically downloaded | ||
as part of `oneAPI_Support_jll.jl`, to accelerate a great number of BLAS and LAPACK | ||
operations on Intel GPUs. Similar to how it is implemented in our other GPU back-ends, these | ||
wrappers are available at different levels of abstraction. | ||
|
||
At the lowest level, we use a C library that wraps the oneMKL C++ APIs. For example, the | ||
`oneapi::mkl::blas::column_major::gemm` function for matrix-matrix multiplication is wrapped | ||
by the C functions `onemklSgemm`, `onemklDgemm`, etc. These wrappers are used to implement | ||
low-level methods like `oneMKL.gemm!`: | ||
|
||
```julia-repl | ||
julia> using oneAPI | ||
julia> A = oneArray(rand(Float32, 2, 3)); | ||
2×3 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}: | ||
0.44302 0.125576 0.859145 | ||
0.674291 0.428346 0.0400119 | ||
julia> B = oneArray(rand(Float32, 3, 4)) | ||
3×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}: | ||
0.592748 0.529413 0.0323396 0.659528 | ||
0.22489 0.0872259 0.253291 0.376519 | ||
0.0121506 0.591135 0.706755 0.751686 | ||
julia> C = similar(B, (2, 4)); | ||
julia> oneMKL.gemm!('N', 'N', true, A, B, true, C) | ||
2×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}: | ||
0.301279 0.753365 0.65334 0.985274 | ||
0.496501 0.417994 0.158581 0.63607 | ||
julia> Array(C) ≈ Array(A) * Array(B) | ||
true | ||
``` | ||
|
||
Of course, these low-level functions aren't very user-friendly, so we also | ||
integrate with Julia's standard libraries where possible: | ||
|
||
```julia-repl | ||
julia> A = oneArray(rand(Float32, 2, 3)); | ||
julia> B = oneArray(rand(Float32, 3, 4)); | ||
julia> using LinearAlgebra | ||
julia> C = A * B; | ||
julia> Array(C) ≈ Array(A) * Array(B) | ||
true | ||
``` | ||
|
||
The most frequently used oneMKL BLAS functions have been wrapped and integrated with Julia’s | ||
standard linear algebra libraries. If you run into a missing function, please file a request | ||
to add it, or take a look at the source and contribute to oneAPI.jl! The current state of | ||
the wrappers should make it easy to extend their functionality, as well as form a good basis | ||
for integrating with other libraries like oneDNN. | ||
|
||
|
||
## Intel Arc support | ||
|
||
The new Arc series of discrete Intel GPUs are now fully supported by oneAPI.jl. These GPUs | ||
offer a significant performance improvement over their integrated predecessors: | ||
|
||
```julia-repl | ||
julia> using oneAPI | ||
julia> oneAPI.versioninfo() | ||
1 device: | ||
- Intel(R) Arc(TM) A770 Graphics [0x56a0] | ||
julia> T = Float32; | ||
julia> n = p = m = 2048; | ||
julia> a = oneArray(rand(T, n, p)); | ||
julia> b = oneArray(rand(T, p, m)); | ||
julia> c = oneArray(zeros(T, n, m)); | ||
julia> using BenchmarkTools, LinearAlgebra | ||
julia> bench = @benchmark oneAPI.@sync mul!(c, a, b) | ||
BenchmarkTools.Trial: 1510 samples with 1 evaluation. | ||
Range (min … max): 3.233 ms … 3.791 ms ┊ GC (min … max): 0.00% … 0.00% | ||
Time (median): 3.298 ms ┊ GC (median): 0.00% | ||
Time (mean ± σ): 3.308 ms ± 48.426 μs ┊ GC (mean ± σ): 0.00% ± 0.00% | ||
▁▃▄▇█▅▄▃▂ ▁▁▁ | ||
▁▁▃▃▅▇██████████████████▇▇▇▅▆▄▅▅▄▂▃▂▂▂▂▂▂▁▂▂▂▁▂▁▂▁▂▂▂▂▁▁▂▂ ▃ | ||
3.23 ms Histogram: frequency by time 3.47 ms < | ||
Memory estimate: 272 bytes, allocs estimate: 11. | ||
julia> flops = n*m*(2p-1) | ||
17175674880 | ||
julia> flops / (minimum(bench.times)/1e9) | ||
5.3131281169900205e12 | ||
``` | ||
|
||
For example, here we're getting over 5 TFlops of Float32 performance, which is over 10x | ||
faster than the Intel Xe Graphics G7 we had been previously using for oneAPI.jl development. | ||
At the same time, the A770 used above should be able to deliver close to 20 TFlops, so | ||
there's still room for improvement in our software stack. | ||
|
||
To use oneAPI.jl with an Arc series GPU, you need to run Linux 6.2. At the time of writing, | ||
that kernel is still in beta, so refer to your distribution's documentation for how to | ||
install it. For example, on Arch Linux you can use the [`linux-mainline` package from the | ||
AUR](https://aur.archlinux.org/packages/linux-mainline), Ubuntu has the [`kernel-ppa` | ||
archive](https://wiki.ubuntu.com/Kernel/MainlineBuilds), Fedora provides the [`stable-rc` | ||
repository](https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories), etc. | ||
|
||
|
||
## Other changes | ||
|
||
- Support for Julia 1.9 has been added. |