-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Link to CUDA.jl 5.0 blog post. (#38)
- Loading branch information
Showing
2 changed files
with
205 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
+++ | ||
title = "CUDA.jl 5.0: Integrated profiler and task synchronization changes" | ||
author = "Tim Besard" | ||
external = true | ||
abstract = """ | ||
CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks | ||
how tasks are synchronized. The release is slightly breaking, as it changes how local | ||
toolkits are handled and raises the minimum Julia and CUDA versions.""" | ||
+++ | ||
|
||
{{abstract}} | ||
|
||
{{ redirect "https://info.juliahub.com/cuda-jl-5-0-changes" }} | ||
|
||
|
||
<!-- | ||
## Integrated profiler | ||
The most exciting new feature in CUDA.jl 5.0 is [the new integrated | ||
profiler](https://github.com/JuliaGPU/CUDA.jl/pull/2024), which is similar to the `@profile` | ||
macro from the Julia standard library. The profiler can be used by simply prefixing any code | ||
that uses the CUDA libraries with `CUDA.@profile`: | ||
```julia-repl | ||
julia> CUDA.@profile CUDA.rand(1).+1 | ||
Profiler ran for 268.46 µs, capturing 21 events. | ||
Host-side activity: calling CUDA APIs took 230.79 µs (85.97% of the trace) | ||
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐ | ||
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │ | ||
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤ | ||
│ 76.47% │ 205.28 µs │ 1 │ 205.28 µs │ 205.28 µs │ 205.28 µs │ cudaLaunchKernel │ | ||
│ 5.42% │ 14.54 µs │ 2 │ 7.27 µs │ 5.01 µs │ 9.54 µs │ cuMemAllocFromPoolAsync │ | ||
│ 2.93% │ 7.87 µs │ 1 │ 7.87 µs │ 7.87 µs │ 7.87 µs │ cuLaunchKernel │ | ||
│ 0.36% │ 953.67 ns │ 2 │ 476.84 ns │ 0.0 ns │ 953.67 ns │ cudaGetLastError │ | ||
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘ | ||
Device-side activity: GPU was busy for 2.15 µs (0.80% of the trace) | ||
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬────────────────────────────── | ||
│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯ | ||
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼────────────────────────────── | ||
│ 0.44% │ 1.19 µs │ 1 │ 1.19 µs │ 1.19 µs │ 1.19 µs │ _Z13gen_sequencedI17curandS ⋯ | ||
│ 0.36% │ 953.67 ns │ 1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ _Z16broadcast_kernel15CuKer ⋯ | ||
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴────────────────────────────── | ||
1 column omitted | ||
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}: | ||
1.7242923 | ||
``` | ||
The output shown above is a summary of what happened during the execution of the code. It is | ||
split into two sections: **host-side activity**, i.e., API calls to the CUDA libraries, and | ||
the resulting **device-side activity**. As part of each section, the output shows the time | ||
spent and the ratio to the total execution time. These ratios are important, and a good tool | ||
to quickly assess the performance of your code. For example, in the above output, we see | ||
that most of the time is spent on the host calling the CUDA libraries, and only very little | ||
time is actually spent computing things on the GPU. This indicates that the GPU is severely | ||
underutilized, which can be solved by increasing the problem size. | ||
Instead of a summary, it is also possible to view a **chronological trace** by passing the | ||
`trace=true` keyword argument: | ||
```julia-repl | ||
julia> CUDA.@profile trace=true CUDA.rand(1).+1; | ||
Profiler ran for 262.98 µs, capturing 21 events. | ||
Host-side activity: calling CUDA APIs took 227.21 µs (86.40% of the trace) | ||
┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐ | ||
│ ID │ Start │ Time │ Name │ Details │ | ||
├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤ | ||
│ 5 │ 6.44 µs │ 9.06 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │ | ||
│ 7 │ 19.31 µs │ 715.26 ns │ cudaGetLastError │ - │ | ||
│ 8 │ 22.41 µs │ 204.09 µs │ cudaLaunchKernel │ - │ | ||
│ 9 │ 227.21 µs │ 0.0 ns │ cudaGetLastError │ - │ | ||
│ 14 │ 232.7 µs │ 3.58 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │ | ||
│ 18 │ 250.34 µs │ 7.39 µs │ cuLaunchKernel │ - │ | ||
└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘ | ||
Device-side activity: GPU was busy for 2.38 µs (0.91% of the trace) | ||
┌────┬───────────┬─────────┬─────────┬────────┬──────┬──────────────────────────────────────────── | ||
│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Name ⋯ | ||
├────┼───────────┼─────────┼─────────┼────────┼──────┼──────────────────────────────────────────── | ||
│ 8 │ 225.31 µs │ 1.19 µs │ 64 │ 64 │ 38 │ _Z13gen_sequencedI17curandStateXORWOWfiXa ⋯ | ||
│ 18 │ 257.73 µs │ 1.19 µs │ 1 │ 1 │ 18 │ _Z16broadcast_kernel15CuKernelContext13Cu ⋯ | ||
└────┴───────────┴─────────┴─────────┴────────┴──────┴──────────────────────────────────────────── | ||
1 column omitted | ||
``` | ||
Here, we can see a list of events that the profiler captured. Each event has a unique ID, | ||
which can be used to corelate host-side and device-side events. For example, we can see that | ||
event 8 on the host is a call to `cudaLaunchKernel`, which corresponds to to the execution | ||
of a CURAND kernel on the device. | ||
The integrated profiler is a great tool to quickly assess the performance of your GPU | ||
application, identify bottlenecks, and find opportunities for optimization. For complex | ||
applications, however, it is still recommended to use NVIDIA's NSight Systems or Compute | ||
profilers, which provide a more detailed, graphical view of what is happening on the GPU. | ||
## Synchronization on worker threads | ||
Another noteworthy change affects how tasks are synchronized. To enable concurrent | ||
execution, i.e., to make it possible for other Julia tasks to execute while waiting for the | ||
GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a | ||
significant source of latency, at least 25us per invocation but sometimes *much* longer, and | ||
have also been slated for deprecation and eventual removal from the CUDA toolkit. | ||
Instead, on Julia 1.9 and later, CUDA.jl [now | ||
uses](https://github.com/JuliaGPU/CUDA.jl/pull/2025) worker threads to wait for GPU | ||
operations to finish. This mechanism is significantly faster, taking around 5us per | ||
invocation, but more importantly offers a much more reliable and predictable latency. You | ||
can observe this mechanism using the integrated profiler: | ||
```julia-repl | ||
julia> a = CUDA.rand(1024, 1024, 1024) | ||
julia> CUDA.@profile trace=true CUDA.@sync a .+ a | ||
Profiler ran for 12.29 ms, capturing 527 events. | ||
Host-side activity: calling CUDA APIs took 11.75 ms (95.64% of the trace) | ||
┌─────┬───────────┬───────────┬────────┬─────────────────────────┐ | ||
│ ID │ Start │ Time │ Thread │ Name │ | ||
├─────┼───────────┼───────────┼────────┼─────────────────────────┤ | ||
│ 5 │ 6.91 µs │ 13.59 µs │ 1 │ cuMemAllocFromPoolAsync │ | ||
│ 9 │ 36.72 µs │ 199.56 µs │ 1 │ cuLaunchKernel │ | ||
│ 525 │ 510.69 µs │ 11.75 ms │ 2 │ cuStreamSynchronize │ | ||
└─────┴───────────┴───────────┴────────┴─────────────────────────┘ | ||
``` | ||
For some users, this may still be too slow, so we have added two mechanisms that disable | ||
nonblocking synchronization and simply block the calling thread until the GPU operation | ||
finishes. The first is a global setting, which can be enabled by setting the | ||
`nonblocking_synchronization` preference to `false`, which can be done using Preferences.jl. | ||
[The second](https://github.com/JuliaGPU/CUDA.jl/pull/2060) is a fine-grained flag to pass | ||
to synchronization functions: `synchronize(x; blocking=true)`, `CUDA.@sync blocking=true | ||
...`, etc. Both these mechanisms should *not* be used widely, and are only intended for use | ||
in latency-critical code, e.g., when benchmarking or profiling. | ||
## Local toolkit discovery | ||
One of the breaking changes involves [how local toolkits are | ||
discovered](https://github.com/JuliaGPU/CUDA.jl/pull/2058), when opting out of the use of | ||
artifacts. Previously, this could be enabled by calling | ||
`CUDA.set_runtime_version!("local")`, which generated a `version = "local"` preference. We | ||
are now changing this into two separate preferences, `version` and `local`, where the | ||
`version` preference overrides the version of the CUDA toolkit, and the `local` | ||
preference independently indicates whether to use a local CUDA toolkit or not. | ||
Concretely, this means that you will now need to call | ||
`CUDA.set_runtime_version!(local_toolkit=true)` to enable the use of a local toolkit. The | ||
toolkit version will be auto-detected, but can be overridden by also passing a version: | ||
`CUDA.set_runtime_version!(version; local_toolkit=true)`. This may be necessary when CUDA | ||
is not available during precompilation, e.g., on the log-in node of a cluster, or when | ||
building a container image. | ||
## Raised minimum requirements | ||
Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version | ||
is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit | ||
version is now 11.4, but this cannot be enforced by the package manager. As a result, if you | ||
need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or | ||
below. [The README](https://github.com/JuliaGPU/CUDA.jl/blob/master/README.md) will maintain | ||
a table of supported CUDA toolkit versions. | ||
Most users will not be affected by this change: If you use the artifact-provided CUDA | ||
toolkit, you will automatically get the latest version supported by your CUDA driver. | ||
## Other changes | ||
- [Support for CUDA 12.2](https://github.com/JuliaGPU/CUDA.jl/pull/2034); | ||
- [Memory limits](https://github.com/JuliaGPU/CUDA.jl/pull/2040) are now enforced by CUDA, | ||
resulting in better performance; | ||
- [Support for Julia 1.10](https://github.com/JuliaGPU/CUDA.jl/pull/1946) (with help from | ||
[@dkarrasch](https://github.com/dkarrasch)); | ||
- Support for batched [`gemm`](https://github.com/JuliaGPU/CUDA.jl/pull/1975), | ||
[`gemv`](https://github.com/JuliaGPU/CUDA.jl/pull/1981) and | ||
[`svd`](https://github.com/JuliaGPU/CUDA.jl/pull/2063) (by | ||
[@lpawela](https://github.com/lpawela) and [@nikopj](https://github.com/nikopj). | ||
--> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters