Link to CUDA.jl 5.0 blog post. (#38)

JuliaGPU · Sep 26, 2023 · 0fe2b7b · 0fe2b7b
1 parent acdd39b
commit 0fe2b7b
Show file tree

Hide file tree

Showing 2 changed files with 205 additions and 1 deletion.
diff --git a/post/2023-09-19-cuda_5.0.md b/post/2023-09-19-cuda_5.0.md
@@ -0,0 +1,182 @@
++++
+title = "CUDA.jl 5.0: Integrated profiler and task synchronization changes"
+author = "Tim Besard"
+external = true
+abstract = """
+  CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks
+  how tasks are synchronized. The release is slightly breaking, as it changes how local
+  toolkits are handled and raises the minimum Julia and CUDA versions."""
++++
+
+{{abstract}}
+
+{{ redirect "https://info.juliahub.com/cuda-jl-5-0-changes" }}
+
+
+<!--
+
+## Integrated profiler
+
+The most exciting new feature in CUDA.jl 5.0 is [the new integrated
+profiler](https://github.com/JuliaGPU/CUDA.jl/pull/2024), which is similar to the `@profile`
+macro from the Julia standard library. The profiler can be used by simply prefixing any code
+that uses the CUDA libraries with `CUDA.@profile`:
+
+```julia-repl
+julia> CUDA.@profile CUDA.rand(1).+1
+Profiler ran for 268.46 µs, capturing 21 events.
+
+Host-side activity: calling CUDA APIs took 230.79 µs (85.97% of the trace)
+┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
+│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
+├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
+│   76.47% │ 205.28 µs │     1 │ 205.28 µs │ 205.28 µs │ 205.28 µs │ cudaLaunchKernel        │
+│    5.42% │  14.54 µs │     2 │   7.27 µs │   5.01 µs │   9.54 µs │ cuMemAllocFromPoolAsync │
+│    2.93% │   7.87 µs │     1 │   7.87 µs │   7.87 µs │   7.87 µs │ cuLaunchKernel          │
+│    0.36% │ 953.67 ns │     2 │ 476.84 ns │    0.0 ns │ 953.67 ns │ cudaGetLastError        │
+└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘
+
+Device-side activity: GPU was busy for 2.15 µs (0.80% of the trace)
+┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬──────────────────────────────
+│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                        ⋯
+├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼──────────────────────────────
+│    0.44% │   1.19 µs │     1 │   1.19 µs │   1.19 µs │   1.19 µs │ _Z13gen_sequencedI17curandS ⋯
+│    0.36% │ 953.67 ns │     1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ _Z16broadcast_kernel15CuKer ⋯
+└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
+                                                                                  1 column omitted
+1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
+ 1.7242923
+```
+
+The output shown above is a summary of what happened during the execution of the code. It is
+split into two sections: **host-side activity**, i.e., API calls to the CUDA libraries, and
+the resulting **device-side activity**. As part of each section, the output shows the time
+spent and the ratio to the total execution time. These ratios are important, and a good tool
+to quickly assess the performance of your code. For example, in the above output, we see
+that most of the time is spent on the host calling the CUDA libraries, and only very little
+time is actually spent computing things on the GPU. This indicates that the GPU is severely
+underutilized, which can be solved by increasing the problem size.
+
+Instead of a summary, it is also possible to view a **chronological trace** by passing the
+`trace=true` keyword argument:
+
+```julia-repl
+julia> CUDA.@profile trace=true CUDA.rand(1).+1;
+Profiler ran for 262.98 µs, capturing 21 events.
+
+Host-side activity: calling CUDA APIs took 227.21 µs (86.40% of the trace)
+┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐
+│ ID │     Start │      Time │                    Name │ Details                │
+├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤
+│  5 │   6.44 µs │   9.06 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
+│  7 │  19.31 µs │ 715.26 ns │        cudaGetLastError │ -                      │
+│  8 │  22.41 µs │ 204.09 µs │        cudaLaunchKernel │ -                      │
+│  9 │ 227.21 µs │    0.0 ns │        cudaGetLastError │ -                      │
+│ 14 │  232.7 µs │   3.58 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
+│ 18 │ 250.34 µs │   7.39 µs │          cuLaunchKernel │ -                      │
+└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘
+
+Device-side activity: GPU was busy for 2.38 µs (0.91% of the trace)
+┌────┬───────────┬─────────┬─────────┬────────┬──────┬────────────────────────────────────────────
+│ ID │     Start │    Time │ Threads │ Blocks │ Regs │ Name                                      ⋯
+├────┼───────────┼─────────┼─────────┼────────┼──────┼────────────────────────────────────────────
+│  8 │ 225.31 µs │ 1.19 µs │      64 │     64 │   38 │ _Z13gen_sequencedI17curandStateXORWOWfiXa ⋯
+│ 18 │ 257.73 µs │ 1.19 µs │       1 │      1 │   18 │ _Z16broadcast_kernel15CuKernelContext13Cu ⋯
+└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
+                                                                                  1 column omitted
+```
+
+Here, we can see a list of events that the profiler captured. Each event has a unique ID,
+which can be used to corelate host-side and device-side events. For example, we can see that
+event 8 on the host is a call to `cudaLaunchKernel`, which corresponds to to the execution
+of a CURAND kernel on the device.
+
+The integrated profiler is a great tool to quickly assess the performance of your GPU
+application, identify bottlenecks, and find opportunities for optimization. For complex
+applications, however, it is still recommended to use NVIDIA's NSight Systems or Compute
+profilers, which provide a more detailed, graphical view of what is happening on the GPU.
+
+
+## Synchronization on worker threads
+
+Another noteworthy change affects how tasks are synchronized. To enable concurrent
+execution, i.e., to make it possible for other Julia tasks to execute while waiting for the
+GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a
+significant source of latency, at least 25us per invocation but sometimes *much* longer, and
+have also been slated for deprecation and eventual removal from the CUDA toolkit.
+
+Instead, on Julia 1.9 and later, CUDA.jl [now
+uses](https://github.com/JuliaGPU/CUDA.jl/pull/2025) worker threads to wait for GPU
+operations to finish. This mechanism is significantly faster, taking around 5us per
+invocation, but more importantly offers a much more reliable and predictable latency. You
+can observe this mechanism using the integrated profiler:
+
+```julia-repl
+julia> a = CUDA.rand(1024, 1024, 1024)
+julia> CUDA.@profile trace=true CUDA.@sync a .+ a
+Profiler ran for 12.29 ms, capturing 527 events.
+
+Host-side activity: calling CUDA APIs took 11.75 ms (95.64% of the trace)
+┌─────┬───────────┬───────────┬────────┬─────────────────────────┐
+│  ID │     Start │      Time │ Thread │                    Name │
+├─────┼───────────┼───────────┼────────┼─────────────────────────┤
+│   5 │   6.91 µs │  13.59 µs │      1 │ cuMemAllocFromPoolAsync │
+│   9 │  36.72 µs │ 199.56 µs │      1 │          cuLaunchKernel │
+│ 525 │ 510.69 µs │  11.75 ms │      2 │     cuStreamSynchronize │
+└─────┴───────────┴───────────┴────────┴─────────────────────────┘
+```
+
+For some users, this may still be too slow, so we have added two mechanisms that disable
+nonblocking synchronization and simply block the calling thread until the GPU operation
+finishes. The first is a global setting, which can be enabled by setting the
+`nonblocking_synchronization` preference to `false`, which can be done using Preferences.jl.
+[The second](https://github.com/JuliaGPU/CUDA.jl/pull/2060) is a fine-grained flag to pass
+to synchronization functions: `synchronize(x; blocking=true)`, `CUDA.@sync blocking=true
+...`, etc. Both these mechanisms should *not* be used widely, and are only intended for use
+in latency-critical code, e.g., when benchmarking or profiling.
+
+
+## Local toolkit discovery
+
+One of the breaking changes involves [how local toolkits are
+discovered](https://github.com/JuliaGPU/CUDA.jl/pull/2058), when opting out of the use of
+artifacts. Previously, this could be enabled by calling
+`CUDA.set_runtime_version!("local")`, which generated a `version = "local"` preference. We
+are now changing this into two separate preferences, `version` and `local`, where the
+`version` preference overrides the version of the CUDA toolkit, and the `local`
+preference independently indicates whether to use a local CUDA toolkit or not.
+
+Concretely, this means that you will now need to call
+`CUDA.set_runtime_version!(local_toolkit=true)` to enable the use of a local toolkit. The
+toolkit version will be auto-detected, but can be overridden by also passing a version:
+`CUDA.set_runtime_version!(version; local_toolkit=true)`. This may be necessary when CUDA
+is not available during precompilation, e.g., on the log-in node of a cluster, or when
+building a container image.
+
+
+## Raised minimum requirements
+
+Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version
+is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit
+version is now 11.4, but this cannot be enforced by the package manager. As a result, if you
+need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or
+below. [The README](https://github.com/JuliaGPU/CUDA.jl/blob/master/README.md) will maintain
+a table of supported CUDA toolkit versions.
+
+Most users will not be affected by this change: If you use the artifact-provided CUDA
+toolkit, you will automatically get the latest version supported by your CUDA driver.
+
+
+## Other changes
+
+- [Support for CUDA 12.2](https://github.com/JuliaGPU/CUDA.jl/pull/2034);
+- [Memory limits](https://github.com/JuliaGPU/CUDA.jl/pull/2040) are now enforced by CUDA,
+  resulting in better performance;
+- [Support for Julia 1.10](https://github.com/JuliaGPU/CUDA.jl/pull/1946) (with help from
+  [@dkarrasch](https://github.com/dkarrasch));
+- Support for batched [`gemm`](https://github.com/JuliaGPU/CUDA.jl/pull/1975),
+  [`gemv`](https://github.com/JuliaGPU/CUDA.jl/pull/1981) and
+  [`svd`](https://github.com/JuliaGPU/CUDA.jl/pull/2063) (by
+  [@lpawela](https://github.com/lpawela) and [@nikopj](https://github.com/nikopj).
+
+-->
diff --git a/utils.jl b/utils.jl
@@ -27,6 +27,27 @@ function getdate(fname)
     return parse.(Int, (y, m, d))
 end
 
+function hfun_redirect(params)
+    url = params[1]
+
+    # XXX: is this safe?
+    #Franklin.set_var!(Franklin.LOCAL_VARS, "fd_full_url", url)
+
+    return """
+        <meta http-equiv="refresh" content="0;url=$url">
+
+        <!-- fallback 1: use Javscript, in case the browser doesn't like meta tags in the body -->
+        <script>
+            window.onload = function() {
+                window.location.href = "$url";
+            }
+        </script>
+
+        <!-- fallback 2: provide a visual element, in case Javascript is disabled -->
+        <p>This blog post is located at <a href="$url">$url</a></p>
+        """
+end
+
 function hfun_post_date()
     # capture the RSS publication date from the file name
     fd_url = locvar(:fd_url)::String
@@ -72,11 +93,12 @@ function blogpost_entry(fpath)
     if hidden === true
         return nothing
     end
+    ext = something(pagevar(rpath, :external), false)
     title = pagevar(rpath, :title)::String
     y, m, d = getdate(fpath)
     rpath = replace(fpath, r"\.md$" => "")
     date = Date(y, m, d)
-    return (date, blogpost_entry_html("/post/$rpath/", title, y, m, d))
+    return (date, blogpost_entry_html("/post/$rpath/", title, y, m, d; ext))
 end
 
 function blogpost_external_entries()