Add blog post on oneAPI.jl 1.0. (#31)

JuliaGPU · Feb 13, 2023 · 5057680 · 5057680
1 parent 480a145
commit 5057680
Showing 1 changed file with 121 additions and 0 deletions.
diff --git a/post/2023-02-08-oneapi_1.0.md b/post/2023-02-08-oneapi_1.0.md
@@ -0,0 +1,121 @@
++++
+title = "oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9"
+author = "Tim Besard"
+abstract = """
+  The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel
+  Library (oneMKL) to accelerate linear algebra operations on Intel GPUs.
+  It also brings support for Julia 1.9 and Intel Arc GPUs."""
++++
+
+{{abstract}}
+
+
+## oneMKL integration
+
+oneAPI.jl now uses the Intel oneAPI Math Kernel Library (oneMKL), automatically downloaded
+as part of `oneAPI_Support_jll.jl`, to accelerate a great number of BLAS and LAPACK
+operations on Intel GPUs. Similar to how it is implemented in our other GPU back-ends, these
+wrappers are available at different levels of abstraction.
+
+At the lowest level, we use a C library that wraps the oneMKL C++ APIs. For example, the
+`oneapi::mkl::blas::column_major::gemm` function for matrix-matrix multiplication is wrapped
+by the C functions `onemklSgemm`, `onemklDgemm`, etc. These wrappers are used to implement
+low-level methods like `oneMKL.gemm!`:
+
+```julia-repl
+julia> using oneAPI
+
+julia> A = oneArray(rand(Float32, 2, 3));
+2×3 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
+ 0.44302   0.125576  0.859145
+ 0.674291  0.428346  0.0400119
+julia> B = oneArray(rand(Float32, 3, 4))
+3×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
+ 0.592748   0.529413   0.0323396  0.659528
+ 0.22489    0.0872259  0.253291   0.376519
+ 0.0121506  0.591135   0.706755   0.751686
+julia> C = similar(B, (2, 4));
+
+julia> oneMKL.gemm!('N', 'N', true, A, B, true, C)
+2×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
+ 0.301279  0.753365  0.65334   0.985274
+ 0.496501  0.417994  0.158581  0.63607
+
+julia> Array(C) ≈ Array(A) * Array(B)
+true
+```
+
+Of course, these low-level functions aren't very user-friendly, so we also
+integrate with Julia's standard libraries where possible:
+
+```julia-repl
+julia> A = oneArray(rand(Float32, 2, 3));
+julia> B = oneArray(rand(Float32, 3, 4));
+
+julia> using LinearAlgebra
+julia> C = A * B;
+
+julia> Array(C) ≈ Array(A) * Array(B)
+true
+```
+
+The most frequently used oneMKL BLAS functions have been wrapped and integrated with Julia’s
+standard linear algebra libraries. If you run into a missing function, please file a request
+to add it, or take a look at the source and contribute to oneAPI.jl! The current state of
+the wrappers should make it easy to extend their functionality, as well as form a good basis
+for integrating with other libraries like oneDNN.
+
+
+## Intel Arc support
+
+The new Arc series of discrete Intel GPUs are now fully supported by oneAPI.jl. These GPUs
+offer a significant performance improvement over their integrated predecessors:
+
+```julia-repl
+julia> using oneAPI
+julia> oneAPI.versioninfo()
+1 device:
+- Intel(R) Arc(TM) A770 Graphics [0x56a0]
+
+julia> T = Float32;
+julia> n = p = m = 2048;
+julia> a = oneArray(rand(T, n, p));
+julia> b = oneArray(rand(T, p, m));
+julia> c = oneArray(zeros(T, n, m));
+
+julia> using BenchmarkTools, LinearAlgebra
+julia> bench = @benchmark oneAPI.@sync mul!(c, a, b)
+BenchmarkTools.Trial: 1510 samples with 1 evaluation.
+ Range (min … max):  3.233 ms …  3.791 ms  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     3.298 ms              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   3.308 ms ± 48.426 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
+
+        ▁▃▄▇█▅▄▃▂   ▁▁▁
+  ▁▁▃▃▅▇██████████████████▇▇▇▅▆▄▅▅▄▂▃▂▂▂▂▂▂▁▂▂▂▁▂▁▂▁▂▂▂▂▁▁▂▂ ▃
+  3.23 ms        Histogram: frequency by time        3.47 ms <
+
+ Memory estimate: 272 bytes, allocs estimate: 11.
+
+julia> flops = n*m*(2p-1)
+17175674880
+
+julia> flops / (minimum(bench.times)/1e9)
+5.3131281169900205e12
+```
+
+For example, here we're getting over 5 TFlops of Float32 performance, which is over 10x
+faster than the Intel Xe Graphics G7 we had been previously using for oneAPI.jl development.
+At the same time, the A770 used above should be able to deliver close to 20 TFlops, so
+there's still room for improvement in our software stack.
+
+To use oneAPI.jl with an Arc series GPU, you need to run Linux 6.2. At the time of writing,
+that kernel is still in beta, so refer to your distribution's documentation for how to
+install it. For example, on Arch Linux you can use the [`linux-mainline` package from the
+AUR](https://aur.archlinux.org/packages/linux-mainline), Ubuntu has the [`kernel-ppa`
+archive](https://wiki.ubuntu.com/Kernel/MainlineBuilds), Fedora provides the [`stable-rc`
+repository](https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories), etc.
+
+
+## Other changes
+
+- Support for Julia 1.9 has been added.