diff --git a/README.md b/README.md index def552540..e6f0d914a 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Achieving optimal performance in GPU-centric workflows frequently requires customizing how host and device memory are allocated. For example, using "pinned" host memory for asynchronous host <-> device memory transfers, or using a device memory pool sub-allocator to reduce the cost of -dynamic device memory allocation. +dynamic device memory allocation. The goal of the RAPIDS Memory Manager (RMM) is to provide: - A common interface that allows customizing [device](#device_memory_resource) and @@ -175,7 +175,7 @@ Thrust. If you want to customize it, you can set the variables # Using RMM in C++ -The first goal of RMM is to provide a common interface for device and host memory allocation. +The first goal of RMM is to provide a common interface for device and host memory allocation. This allows both _users_ and _implementers_ of custom allocation logic to program to a single interface. @@ -198,15 +198,15 @@ It has two key functions: - Returns a pointer to an allocation of at least `bytes` bytes. 2. `void device_memory_resource::deallocate(void* p, std::size_t bytes, cuda_stream_view s)` - - Reclaims a previous allocation of size `bytes` pointed to by `p`. + - Reclaims a previous allocation of size `bytes` pointed to by `p`. - `p` *must* have been returned by a previous call to `allocate(bytes)`, otherwise behavior is undefined It is up to a derived class to provide implementations of these functions. See -[available resources](#available-resources) for example `device_memory_resource` derived classes. +[available resources](#available-resources) for example `device_memory_resource` derived classes. -Unlike `std::pmr::memory_resource`, `rmm::mr::device_memory_resource` does not allow specifying an -alignment argument. All allocations are required to be aligned to at least 256B. Furthermore, +Unlike `std::pmr::memory_resource`, `rmm::mr::device_memory_resource` does not allow specifying an +alignment argument. All allocations are required to be aligned to at least 256B. Furthermore, `device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream on which to perform the (de)allocation. @@ -214,26 +214,26 @@ on which to perform the (de)allocation. `rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer, -which can lead to ambiguity in APIs when it is assigned `0`.) All RMM stream-ordered APIs take a +which can lead to ambiguity in APIs when it is assigned `0`.) All RMM stream-ordered APIs take a `rmm::cuda_stream_view` argument. -`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides +`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream` can never represent the CUDA default stream or per-thread default stream; it only ever represents a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved. ## `cuda_stream_pool` -`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to -create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the +`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to +create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the stream pool can be faster than creating the streams on the fly. The size of the pool is configurable. -Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of +Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of `rmm::cuda_stream_view` that represent identical CUDA streams. ### Thread Safety All current device memory resources are thread safe unless documented otherwise. More specifically, -calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to +calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to either of these functions from other threads. They are _not_ thread safe with respect to construction and destruction of the memory resource object. @@ -259,7 +259,7 @@ used internally by a `device_memory_resource` for managing available memory with synchronization, and it may also be synchronized at a later time, for example using a call to `cudaStreamSynchronize()`. -For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to +For this reason, it is Undefined Behavior to destroy a CUDA stream that is passed to `device_memory_resource::deallocate`. If the stream on which the allocation was last used has been destroyed before calling `deallocate` or it is known that it will be destroyed, it is likely better to synchronize the stream (before destroying it) and then pass a different stream to `deallocate` @@ -268,6 +268,11 @@ to synchronize the stream (before destroying it) and then pass a different strea Note that device memory data structures such as `rmm::device_buffer` and `rmm::device_uvector` follow these stream-ordered memory allocation semantics and rules. +For further information about stream-ordered memory allocation semantics, read +[Using the NVIDIA CUDA Stream-Ordered Memory +Allocator](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/) +on the NVIDIA Developer Blog. + ### Available Resources RMM provides several `device_memory_resource` derived classes to satisfy various user requirements. @@ -279,10 +284,10 @@ Allocates and frees device memory using `cudaMalloc` and `cudaFree`. #### `managed_memory_resource` -Allocates and frees device memory using `cudaMallocManaged` and `cudaFree`. +Allocates and frees device memory using `cudaMallocManaged` and `cudaFree`. Note that `managed_memory_resource` cannot be used with NVIDIA Virtual GPU Software (vGPU, for use -with virtual machines or hypervisors) because [NVIDIA CUDA Unified Memory is not supported by +with virtual machines or hypervisors) because [NVIDIA CUDA Unified Memory is not supported by NVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu). #### `pool_memory_resource` @@ -296,13 +301,13 @@ cost is constant. #### `binning_memory_resource` -Configurable to use multiple upstream memory resources for allocations that fall within different +Configurable to use multiple upstream memory resources for allocations that fall within different bin sizes. Often configured with multiple bins backed by `fixed_size_memory_resource`s and a single `pool_memory_resource` for allocations larger than the largest bin size. ### Default Resources and Per-device Resources -RMM users commonly need to configure a `device_memory_resource` object to use for all allocations +RMM users commonly need to configure a `device_memory_resource` object to use for all allocations where another resource has not explicitly been provided. A common example is configuring a `pool_memory_resource` to use for all allocations to get fast dynamic allocation. @@ -313,7 +318,7 @@ Accessing and modifying the default resource is done through two functions: - `device_memory_resource* get_current_device_resource()` - Returns a pointer to the default resource for the current CUDA device. - The initial default memory resource is an instance of `cuda_memory_resource`. - - This function is thread safe with respect to concurrent calls to it and + - This function is thread safe with respect to concurrent calls to it and `set_current_device_resource()`. - For more explicit control, you can use `get_per_device_resource()`, which takes a device ID. @@ -342,13 +347,13 @@ that was active when the `device_memory_resource` was created. Otherwise behavio If a `device_memory_resource` is used with a stream associated with a different CUDA device than the device for which the memory resource was created, behavior is undefined. - + Creating a `device_memory_resource` for each device requires care to set the current device before creating each resource, and to maintain the lifetime of the resources as long as they are set as per-device resources. Here is an example loop that creates `unique_ptr`s to `pool_memory_resource` objects for each device and sets them as the per-device resource for that device. -```c++ +```c++ std::vector> per_device_pools; for(int i = 0; i < N; ++i) { cudaSetDevice(i); // set device i before creating MR @@ -361,12 +366,12 @@ for(int i = 0; i < N; ++i) { ### Allocators -C++ interfaces commonly allow customizable memory allocation through an [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) object. +C++ interfaces commonly allow customizable memory allocation through an [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) object. RMM provides several `Allocator` and `Allocator`-like classes. #### `polymorphic_allocator` -A [stream-ordered](#stream-ordered-memory-allocation) allocator similar to [`std::pmr::polymorphic_allocator`](https://en.cppreference.com/w/cpp/memory/polymorphic_allocator). +A [stream-ordered](#stream-ordered-memory-allocation) allocator similar to [`std::pmr::polymorphic_allocator`](https://en.cppreference.com/w/cpp/memory/polymorphic_allocator). Unlike the standard C++ `Allocator` interface, the `allocate` and `deallocate` functions take a `cuda_stream_view` indicating the stream on which the (de)allocation occurs. #### `stream_allocator_adaptor` @@ -382,17 +387,17 @@ rmm::mr::polymorphic_allocator stream_alloc; auto adapted = rmm::mr::make_stream_allocator_adaptor(stream_alloc, stream); // Allocates 100 bytes using `stream_alloc` on `stream` -auto p = adapted.allocate(100); +auto p = adapted.allocate(100); ... // Deallocates using `stream_alloc` on `stream` -adapted.deallocate(p,100); +adapted.deallocate(p,100); ``` #### `thrust_allocator` `thrust_allocator` is a device memory allocator that uses the strongly typed `thrust::device_ptr`, making it usable with containers like `thrust::device_vector`. -See [below](#using-rmm-with-thrust) for more information on using RMM with Thrust. +See [below](#using-rmm-with-thrust) for more information on using RMM with Thrust. ## Device Data Structures @@ -405,7 +410,7 @@ An untyped, uninitialized RAII class for stream ordered device memory allocation ```c++ cuda_stream_view s{...}; // Allocates at least 100 bytes on stream `s` using the *default* resource -rmm::device_buffer b{100,s}; +rmm::device_buffer b{100,s}; void* p = b.data(); // Raw, untyped pointer to underlying device memory kernel<<<..., s.value()>>>(b.data()); // `b` is only safe to use on `s` @@ -428,11 +433,11 @@ cuda_stream_view s{...}; // default resource rmm::device_uvector v(100, s); // Initializes the elements to 0 -thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0}); +thrust::uninitialized_fill(thrust::cuda::par.on(s.value()), v.begin(), v.end(), int32_t{0}); rmm::mr::device_memory_resource * mr = new my_custom_resource{...}; // Allocates uninitialized storage for 100 `int32_t` elements on stream `s` using the resource `mr` -rmm::device_uvector v2{100, s, mr}; +rmm::device_uvector v2{100, s, mr}; ``` ### `device_scalar` @@ -444,7 +449,7 @@ modifying the value in device memory from the host, or retrieving the value from ```c++ cuda_stream_view s{...}; // Allocates uninitialized storage for a single `int32_t` in device memory -rmm::device_scalar a{s}; +rmm::device_scalar a{s}; a.set_value(42, s); // Updates the value in device memory to `42` on stream `s` kernel<<<...,s.value()>>>(a.data()); // Pass raw pointer to underlying element in device memory @@ -464,11 +469,11 @@ Similar to `device_memory_resource`, it has two key functions for (de)allocation `alignment` 2. `void host_memory_resource::deallocate(void* p, std::size_t bytes, std::size_t alignment)` - - Reclaims a previous allocation of size `bytes` pointed to by `p`. + - Reclaims a previous allocation of size `bytes` pointed to by `p`. Unlike `device_memory_resource`, the `host_memory_resource` interface and behavior is identical to -`std::pmr::memory_resource`. +`std::pmr::memory_resource`. ### Available Resources @@ -517,7 +522,7 @@ RMM includes two forms of logging. Memory event logging and debug logging. ### Memory Event Logging and `logging_resource_adaptor` Memory event logging writes details of every allocation or deallocation to a CSV (comma-separated -value) file. In C++, Memory Event Logging is enabled by using the `logging_resource_adaptor` as a +value) file. In C++, Memory Event Logging is enabled by using the `logging_resource_adaptor` as a wrapper around any other `device_memory_resource` object. Each row in the log represents either an allocation or a deallocation. The columns of the file are @@ -536,7 +541,7 @@ rmm::mr::cuda_memory_resource upstream; rmm::mr::logging_resource_adaptor log_mr{&upstream, filename}; ``` -If a file name is not specified, the environment variable `RMM_LOG_FILE` is queried for the file +If a file name is not specified, the environment variable `RMM_LOG_FILE` is queried for the file name. If `RMM_LOG_FILE` is not set, then an exception is thrown by the `logging_resource_adaptor` constructor. @@ -546,7 +551,7 @@ set to `True`. The log file name can be set using the `log_file_name` parameter. ### Debug Logging -RMM includes a debug logger which can be enabled to log trace and debug information to a file. This +RMM includes a debug logger which can be enabled to log trace and debug information to a file. This information can show when errors occur, when additional memory is allocated from upstream resources, etc. The default log file is `rmm_log.txt` in the current working directory, but the environment variable `RMM_DEBUG_LOG_FILE` can be set to specify the path and file name. @@ -558,27 +563,27 @@ of more detailed logging. The default is `INFO`. Available levels are `TRACE`, ` The log relies on the [spdlog](https://github.com/gabime/spdlog.git) library. Note that to see logging below the `INFO` level, the C++ application must also call -`rmm::logger().set_level()`, e.g. to enable all levels of logging down to `TRACE`, call +`rmm::logger().set_level()`, e.g. to enable all levels of logging down to `TRACE`, call `rmm::logger().set_level(spdlog::level::trace)` (and compile with `-DRMM_LOGGING_LEVEL=TRACE`). -Note that debug logging is different from the CSV memory allocation logging provided by +Note that debug logging is different from the CSV memory allocation logging provided by `rmm::mr::logging_resource_adapter`. The latter is for logging a history of allocation / deallocation actions which can be useful for replay with RMM's replay benchmark. ## RMM and CUDA Memory Bounds Checking Memory allocations taken from a memory resource that allocates a pool of memory (such as -`pool_memory_resource` and `arena_memory_resource`) are part of the same low-level CUDA memory -allocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to -be detected by CUDA tools such as +`pool_memory_resource` and `arena_memory_resource`) are part of the same low-level CUDA memory +allocation. Therefore, out-of-bounds or misaligned accesses to these allocations are not likely to +be detected by CUDA tools such as [CUDA Compute Sanitizer](https://docs.nvidia.com/cuda/compute-sanitizer/index.html) memcheck. -Exceptions to this are `cuda_memory_resource`, which wraps `cudaMalloc`, and -`cuda_async_memory_resource`, which uses `cudaMallocAsync` with CUDA's built-in memory pool -functionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by these +Exceptions to this are `cuda_memory_resource`, which wraps `cudaMalloc`, and +`cuda_async_memory_resource`, which uses `cudaMallocAsync` with CUDA's built-in memory pool +functionality (CUDA 11.2 or later required). Illegal memory accesses to memory allocated by these resources are detectable with Compute Sanitizer Memcheck. -It may be possible in the future to add support for memory bounds checking with other memory +It may be possible in the future to add support for memory bounds checking with other memory resources using NVTX APIs. ## Using RMM in Python Code @@ -637,7 +642,7 @@ array([1., 2., 3.]) `MemoryResource` objects are used to configure how device memory allocations are made by RMM. -By default if a `MemoryResource` is not set explicitly, RMM uses the `CudaMemoryResource`, which +By default if a `MemoryResource` is not set explicitly, RMM uses the `CudaMemoryResource`, which uses `cudaMalloc` for allocating device memory. `rmm.reinitialize()` provides an easy way to initialize RMM with specific memory resource options @@ -677,10 +682,10 @@ of 1 GiB and a maximum size of 4 GiB. The pool uses Other MemoryResources include: * `FixedSizeMemoryResource` for allocating fixed blocks of memory -* `BinningMemoryResource` for allocating blocks within specified "bin" sizes from different memory +* `BinningMemoryResource` for allocating blocks within specified "bin" sizes from different memory resources -MemoryResources are highly configurable and can be composed together in different ways. +MemoryResources are highly configurable and can be composed together in different ways. See `help(rmm.mr)` for more information. ### Using RMM with CuPy @@ -696,8 +701,8 @@ allocations by setting the CuPy CUDA allocator to ``` -**Note:** This only configures CuPy to use the current RMM resource for allocations. -It does not initialize nor change the current resource, e.g., enabling a memory pool. +**Note:** This only configures CuPy to use the current RMM resource for allocations. +It does not initialize nor change the current resource, e.g., enabling a memory pool. See [here](#memoryresource-objects) for more information on changing the current memory resource. ### Using RMM with Numba @@ -721,6 +726,6 @@ This can be done in two ways: >>> cuda.set_memory_manager(rmm.RMMNumbaManager) ``` -**Note:** This only configures Numba to use the current RMM resource for allocations. -It does not initialize nor change the current resource, e.g., enabling a memory pool. +**Note:** This only configures Numba to use the current RMM resource for allocations. +It does not initialize nor change the current resource, e.g., enabling a memory pool. See [here](#memoryresource-objects) for more information on changing the current memory resource.