Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix stream handling in multi-op device tasks #421

Merged
merged 13 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Both methods are supported. However, for most users we _strongly_ recommend to b
- Boost.Range: header-only, *only used for unit testing*
- [BTAS](http://github.com/ValeevGroup/BTAS), tag 3c91f086090390930bba62c6512c4e74a5520e76 . If usable BTAS installation is not found, TiledArray will download and compile
BTAS from source. *This is the recommended way to compile BTAS for all users*.
- [MADNESS](https://github.com/m-a-d-n-e-s-s/madness), tag 3d585293f0094588778dbd3bec24b65e7bbe6a5d .
- [MADNESS](https://github.com/m-a-d-n-e-s-s/madness), tag 1f307ebbe6604539493e165a7a2b00b366711fd8 .
Only the MADworld runtime and BLAS/LAPACK C API component of MADNESS is used by TiledArray.
If usable MADNESS installation is not found, TiledArray will download and compile
MADNESS from source. *This is the recommended way to compile MADNESS for all users*.
Expand All @@ -68,7 +68,7 @@ Optional prerequisites:
- [CUDA compiler and runtime](https://developer.nvidia.com/cuda-zone) -- for execution on NVIDIA's CUDA-enabled accelerators. CUDA 11 or later is required.
- [HIP/ROCm compiler and runtime](https://developer.nvidia.com/cuda-zone) -- for execution on AMD's ROCm-enabled accelerators. Note that TiledArray does not use ROCm directly but its C++ Heterogeneous-Compute Interface for Portability, `HIP`; although HIP can also be used to program CUDA-enabled devices, in TiledArray it is used only to program ROCm devices, hence ROCm and HIP will be used interchangeably.
- [LibreTT](github.com/victor-anisimov/LibreTT) -- free tensor transpose library for CUDA, ROCm, and SYCL platforms that is based on the [original cuTT library](github.com/ap-hynninen/cutt) extended to provide thread-safety improvements (via github.com/ValeevGroup/cutt) and extended to non-CUDA platforms by [@victor-anisimov](github.com/victor-anisimov) (tag 6eed30d4dd2a5aa58840fe895dcffd80be7fbece).
- [Umpire](github.com/LLNL/Umpire) -- portable memory manager for heterogeneous platforms (tag f9640e0fa4245691cdd434e4f719ac5f7d455f82).
- [Umpire](github.com/LLNL/Umpire) -- portable memory manager for heterogeneous platforms (tag 20839b2e8e8972070dd8f75c7f00d50d6c399716).
- [Doxygen](http://www.doxygen.nl/) -- for building documentation (version 1.8.12 or later).
- [ScaLAPACK](http://www.netlib.org/scalapack/) -- a distributed-memory linear algebra package. If detected, the following C++ components will also be sought and downloaded, if missing:
- [scalapackpp](https://github.com/wavefunction91/scalapackpp.git) -- a modern C++ (C++17) wrapper for ScaLAPACK (tag 6397f52cf11c0dfd82a79698ee198a2fce515d81); pulls and builds the following additional prerequisite
Expand Down
10 changes: 4 additions & 6 deletions doc/dox/dev/Optimization-Guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,8 @@ is devoted to communication. [Default = number of cores reported by ]

## MPI

## CUDA
## GPU/Device compute runtimes

In addition to [the environment variables that control the CUDA runtime behavior](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars), several environment variables control specifically the execution of TiledArray on CUDA devices:
* `TA_CUDA_NUM_STREAMS` -- The number of [CUDA streams](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) used to execute tasks on each device. Each stream can be viewed as a thread in a threadpool, with tasks in a given stream executing in order, but each stream executing independently of others. For small tasks this may need to be increased. [Default=3]
* `CUDA_VISIBLE_DEVICES` -- This CUDA runtime environment variable is queried by TiledArray to determine whether CUDA devices on a multi-GPU node have been pre-mapped to MPI ranks.
* By default (i.e. when # of MPI ranks on a node <= # of _available_ CUDA devices) TiledArray will map 1 device (in the order of increasing rank) to each MPI rank.
* If # of available CUDA devices < # of MPI ranks on a node _and_ `CUDA_VISIBLE_DEVICES` is set TiledArray will assume that the user mapped the devices to the MPI ranks appropriately (e.g. using a resource manager like `jsrun`) and only checks that each rank has access to 1 CUDA device.
In addition to the environment variables that control the runtime behavior of [CUDA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) and [HIP/ROCm](https://rocm.docs.amd.com/en/latest/search.html?q=environment+variables), several environment variables control specifically the execution of TiledArray on compute devices:
* `TA_DEVICE_NUM_STREAMS` -- The number of [compute streams](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) used to execute tasks on each device. Each stream can be viewed as a thread in a threadpool, with tasks in a given stream executing in order, but each stream executing independently of others. For small tasks this may need to be increased. In addition stream for compute tasks TiledArray also creates 2 dedicated streams for data transfers to/from each device. [Default=3]
* `CUDA_VISIBLE_DEVICES`/`HIP_VISIBLE_DEVICES` -- These runtime environment variables are can be used to map CUDA/HIP devices, respectively, on a multi-device node to MPI ranks. It is usually the responsibility of the resource manager to control this mapping, thus normally it should not be needed. By default TiledArray will assign compute devices on a multidevice node round robin to each MPI rank.
27 changes: 12 additions & 15 deletions examples/device/device_task.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ using tile_type = TA::Tile<tensor_type>;
/// verify the elements in tile is equal to value
void verify(const tile_type& tile, value_type value, std::size_t index) {
// const auto size = tile.size();
std::string message = "verify Tensor: " + std::to_string(index) + '\n';
std::cout << message;
// std::string message = "verify Tensor: " + std::to_string(index) + '\n';
// std::cout << message;
for (auto& num : tile) {
if (num != value) {
std::string error("Error: " + std::to_string(num) + " " +
Expand All @@ -29,33 +29,31 @@ void verify(const tile_type& tile, value_type value, std::size_t index) {
}

tile_type scale(const tile_type& arg, value_type a,
const TiledArray::device::stream_t* stream, std::size_t index) {
DeviceSafeCall(TiledArray::device::setDevice(
TiledArray::deviceEnv::instance()->current_device_id()));
TiledArray::device::Stream stream, std::size_t index) {
/// make result Tensor
using Storage = typename tile_type::tensor_type::storage_type;
Storage result_storage;
auto result_range = arg.range();
make_device_storage(result_storage, arg.size(), *stream);
make_device_storage(result_storage, arg.size(), stream);

typename tile_type::tensor_type result(std::move(result_range),
std::move(result_storage));

/// copy the original Tensor
auto& queue = TiledArray::BLASQueuePool::queue(*stream);
auto& queue = TiledArray::BLASQueuePool::queue(stream);

blas::copy(result.size(), arg.data(), 1, device_data(result.storage()), 1,
queue);

blas::scal(result.size(), a, device_data(result.storage()), 1, queue);

// std::stringstream stream_str;
// stream_str << *stream;
// stream_str << stream;
// std::string message = "run scale on Tensor: " + std::to_string(index) +
// "on stream: " + stream_str.str() + '\n';
// std::cout << message;

TiledArray::device::synchronize_stream(stream);
TiledArray::device::sync_madness_task_with(stream);

return tile_type(std::move(result));
}
Expand All @@ -65,10 +63,10 @@ void process_task(madness::World* world, std::size_t ntask) {
const std::size_t M = 1000;
const std::size_t N = 1000;

std::size_t n_stream = TiledArray::deviceEnv::instance()->num_streams();
std::size_t n_stream = TiledArray::deviceEnv::instance()->num_streams_total();

for (std::size_t i = 0; i < iter; i++) {
auto& stream = TiledArray::deviceEnv::instance()->stream(i % n_stream);
auto stream = TiledArray::deviceEnv::instance()->stream(i % n_stream);

TiledArray::Range range{M, N};

Expand All @@ -77,12 +75,11 @@ void process_task(madness::World* world, std::size_t ntask) {
const double scale_factor = 2.0;

// function pointer to the scale function to call
tile_type (*scale_fn)(const tile_type&, double,
const TiledArray::device::stream_t*, std::size_t) =
&::scale;
tile_type (*scale_fn)(const tile_type&, double, TiledArray::device::Stream,
std::size_t) = &::scale;

madness::Future<tile_type> scale_future = madness::add_device_task(
*world, ::scale, tensor, scale_factor, &stream, ntask * iter + i);
*world, ::scale, tensor, scale_factor, stream, ntask * iter + i);

/// this should start until scale_taskfn is finished
world->taskq.add(verify, scale_future, scale_factor, ntask * iter + i);
Expand Down
4 changes: 2 additions & 2 deletions examples/device/ta_dense_device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -303,13 +303,13 @@ int try_main(int argc, char **argv) {
<< runtimeVersion << std::endl;

{ // print device properties
int num_devices = TA::deviceEnv::instance()->num_devices();
int num_devices = TA::deviceEnv::instance()->num_visible_devices();

if (num_devices <= 0) {
throw std::runtime_error("No GPUs Found!\n");
}

int device_id = TA::deviceEnv::instance()->current_device_id();
const int device_id = TA::deviceEnv::instance()->current_device_id();

int mpi_size = world.size();
int mpi_rank = world.rank();
Expand Down
7 changes: 4 additions & 3 deletions examples/device/ta_reduce_device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@
*
*/

#include <TiledArray/device/btas_um_tensor.h>
#include <tiledarray.h>

#include <TiledArray/device/btas_um_tensor.h>

template <typename Tile>
void do_main_body(TiledArray::World &world, const long Nm, const long Bm,
const long Nn, const long Bn, const long nrepeat) {
Expand Down Expand Up @@ -291,13 +292,13 @@ int try_main(int argc, char **argv) {
<< runtimeVersion << std::endl;

{ // print device properties
int num_devices = TA::deviceEnv::instance()->num_devices();
int num_devices = TA::deviceEnv::instance()->num_visible_devices();

if (num_devices <= 0) {
throw std::runtime_error("No GPUs Found!\n");
}

int device_id = TA::deviceEnv::instance()->current_device_id();
const int device_id = TA::deviceEnv::instance()->current_device_id();

int mpi_size = world.size();
int mpi_rank = world.rank();
Expand Down
4 changes: 2 additions & 2 deletions examples/device/ta_vector_device.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -308,13 +308,13 @@ int try_main(int argc, char **argv) {
<< runtimeVersion << std::endl;

{ // print device properties
int num_devices = TA::deviceEnv::instance()->num_devices();
int num_devices = TA::deviceEnv::instance()->num_visible_devices();

if (num_devices <= 0) {
throw std::runtime_error("No GPUs Found!\n");
}

int device_id = TA::deviceEnv::instance()->current_device_id();
const int device_id = TA::deviceEnv::instance()->current_device_id();

int mpi_size = world.size();
int mpi_rank = world.rank();
Expand Down
9 changes: 5 additions & 4 deletions external/umpire.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ else()
set(enable_umpire_asserts ON)
endif()

# as of now BLT only supports up to C++17, so limit CMAKE_CXX_STANDARD
# as of now BLT only supports up to C++20, so limit CMAKE_CXX_STANDARD
set(BLT_CXX_STD ${CMAKE_CXX_STANDARD})
set(BLT_CXX_STD_MAX 17)
set(BLT_CXX_STD_MAX 20)
if (BLT_CXX_STD GREATER ${BLT_CXX_STD_MAX})
set(BLT_CXX_STD ${BLT_CXX_STD_MAX})
endif()
Expand Down Expand Up @@ -161,7 +161,8 @@ else()
)

# TiledArray_UMPIRE target depends on existence of these directories to be usable from the build tree at configure time
execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_SOURCE_DIR}/src/umpire/tpl/camp/include")
execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_SOURCE_DIR}/src/tpl/umpire/camp/include")
execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_BUILD_DIR}/src/tpl/umpire/camp/include")
execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_BUILD_DIR}/include")

# do install of Umpire as part of building TiledArray's install target
Expand Down Expand Up @@ -190,7 +191,7 @@ set_target_properties(
TiledArray_UMPIRE
PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES
"$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src>;$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src/umpire/tpl/camp/include>;$<BUILD_INTERFACE:${EXTERNAL_BUILD_DIR}/include>;$<INSTALL_INTERFACE:${_UMPIRE_INSTALL_DIR}/include>"
"$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src>;$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src/tpl>;$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src/tpl/umpire/camp/include>;$<BUILD_INTERFACE:${EXTERNAL_BUILD_DIR}/src/tpl/umpire/camp/include>;$<BUILD_INTERFACE:${EXTERNAL_BUILD_DIR}/include>;$<INSTALL_INTERFACE:${_UMPIRE_INSTALL_DIR}/include>"
INTERFACE_LINK_LIBRARIES
"$<BUILD_INTERFACE:${UMPIRE_BUILD_BYPRODUCTS}>;$<INSTALL_INTERFACE:${_UMPIRE_INSTALL_DIR}/lib/libumpire${UMPIRE_DEFAULT_LIBRARY_SUFFIX}>"
)
Expand Down
8 changes: 4 additions & 4 deletions external/versions.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ set(TA_INSTALL_EIGEN_PREVIOUS_VERSION 3.3.7)
set(TA_INSTALL_EIGEN_URL_HASH SHA256=b4c198460eba6f28d34894e3a5710998818515104d6e74e5cc331ce31e46e626)
set(TA_INSTALL_EIGEN_PREVIOUS_URL_HASH MD5=b9e98a200d2455f06db9c661c5610496)

set(TA_TRACKED_MADNESS_TAG 3d585293f0094588778dbd3bec24b65e7bbe6a5d)
set(TA_TRACKED_MADNESS_PREVIOUS_TAG 4785f17bec34e08f10fa4de84c7359f0404a4d78)
set(TA_TRACKED_MADNESS_TAG 1f307ebbe6604539493e165a7a2b00b366711fd8)
set(TA_TRACKED_MADNESS_PREVIOUS_TAG 3d585293f0094588778dbd3bec24b65e7bbe6a5d)
set(TA_TRACKED_MADNESS_VERSION 0.10.1)
set(TA_TRACKED_MADNESS_PREVIOUS_VERSION 0.10.1)

Expand All @@ -30,8 +30,8 @@ set(TA_TRACKED_BTAS_PREVIOUS_TAG 5a45699b78d0540b490c8c769b61033bd4d4f49c)
set(TA_TRACKED_LIBRETT_TAG 6eed30d4dd2a5aa58840fe895dcffd80be7fbece)
set(TA_TRACKED_LIBRETT_PREVIOUS_TAG 354e0ccee54aeb2f191c3ce2c617ebf437e49d83)

set(TA_TRACKED_UMPIRE_TAG f9640e0fa4245691cdd434e4f719ac5f7d455f82)
set(TA_TRACKED_UMPIRE_PREVIOUS_TAG v6.0.0)
set(TA_TRACKED_UMPIRE_TAG 20839b2e8e8972070dd8f75c7f00d50d6c399716)
set(TA_TRACKED_UMPIRE_PREVIOUS_TAG v2023.06.0)

set(TA_TRACKED_SCALAPACKPP_TAG 6397f52cf11c0dfd82a79698ee198a2fce515d81)
set(TA_TRACKED_SCALAPACKPP_PREVIOUS_TAG 711ef363479a90c88788036f9c6c8adb70736cbf )
Expand Down
32 changes: 26 additions & 6 deletions src/TiledArray/conversions/clone.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@
#ifndef TILEDARRAY_CONVERSIONS_CLONE_H__INCLUDED
#define TILEDARRAY_CONVERSIONS_CLONE_H__INCLUDED

#ifdef TILEDARRAY_HAS_DEVICE
#include "TiledArray/device/device_task_fn.h"
#endif

namespace TiledArray {

/// Forward declarations
Expand Down Expand Up @@ -53,12 +57,28 @@ inline DistArray<Tile, Policy> clone(const DistArray<Tile, Policy>& arg) {
if (arg.is_zero(index)) continue;

// Spawn a task to clone the tiles
Future<value_type> tile = world.taskq.add(
[](const value_type& tile) -> value_type {
using TiledArray::clone;
return clone(tile);
},
arg.find(index));

Future<value_type> tile;
if constexpr (!detail::is_device_tile_v<value_type>) {
tile = world.taskq.add(
[](const value_type& tile) -> value_type {
using TiledArray::clone;
return clone(tile);
},
arg.find(index));
} else {
#ifdef TILEDARRAY_HAS_DEVICE
tile = madness::add_device_task(
world,
[](const value_type& tile) -> value_type {
using TiledArray::clone;
return clone(tile);
},
arg.find(index));
#else
abort(); // unreachable
#endif
}

// Store result tile
result.set(index, tile);
Expand Down
14 changes: 7 additions & 7 deletions src/TiledArray/device/blas.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,27 +31,27 @@ bool BLASQueuePool::initialized() { return !queues_.empty(); }

void BLASQueuePool::initialize() {
if (initialized()) return;
queues_.reserve(deviceEnv::instance()->num_streams());
for (std::size_t sidx = 0; sidx != deviceEnv::instance()->num_streams();
queues_.reserve(deviceEnv::instance()->num_streams_total());
for (std::size_t sidx = 0; sidx != deviceEnv::instance()->num_streams_total();
++sidx) {
auto stream = deviceEnv::instance()->stream(
auto q = deviceEnv::instance()->stream(
sidx); // blaspp forsome reason wants non-const lvalue ref to stream
queues_.emplace_back(std::make_unique<blas::Queue>(0, stream));
queues_.emplace_back(std::make_unique<blas::Queue>(q.device, q.stream));
}
}

void BLASQueuePool::finalize() { queues_.clear(); }

blas::Queue& BLASQueuePool::queue(std::size_t ordinal) {
TA_ASSERT(initialized());
TA_ASSERT(ordinal < deviceEnv::instance()->num_streams());
TA_ASSERT(ordinal < deviceEnv::instance()->num_streams_total());
return *(queues_[ordinal]);
}

blas::Queue& BLASQueuePool::queue(device::stream_t const& stream) {
blas::Queue& BLASQueuePool::queue(device::Stream const& stream) {
TA_ASSERT(initialized());
for (auto&& q : queues_) {
if (q->stream() == stream) return *q;
if (q->device() == stream.device && q->stream() == stream.stream) return *q;
}
throw TiledArray::Exception(
"no matching device stream found in the BLAS queue pool");
Expand Down
33 changes: 18 additions & 15 deletions src/TiledArray/device/blas.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,19 +28,13 @@

#ifdef TILEDARRAY_HAS_DEVICE

#include <TiledArray/external/device.h>

#include <TiledArray/error.h>
#include <TiledArray/external/device.h>
#include <TiledArray/tensor/complex.h>

#include <TiledArray/math/blas.h>
#include <blas/device.hh>

namespace TiledArray {

/*
* cuBLAS interface functions
*/

/**
* BLASQueuePool is a singleton controlling a pool of blas::Queue objects:
* - queues map to stream 1-to-1, so do not call Queue::set_stream to maintain
Expand All @@ -54,20 +48,29 @@ struct BLASQueuePool {
static void finalize();

static blas::Queue &queue(std::size_t ordinal = 0);
static blas::Queue &queue(const device::stream_t &stream);
static blas::Queue &queue(const device::Stream &s);

private:
static std::vector<std::unique_ptr<blas::Queue>> queues_;
};

namespace detail {
/// maps a (tile) Range to blas::Queue; if had already pushed work into a
/// device::Stream (as indicated by madness_task_current_stream() )
/// will return that Stream instead
/// @param[in] range will determine the device::Stream to compute an object
/// associated with this Range object
/// @return the device::Stream to use for creating tasks generating work
/// associated with Range \p range
template <typename Range>
blas::Queue &get_blasqueue_based_on_range(const Range &range) {
// TODO better way to get stream based on the id of tensor
auto stream_ord = range.offset() % device::Env::instance()->num_streams();
return BLASQueuePool::queue(stream_ord);
blas::Queue &blasqueue_for(const Range &range) {
auto stream_opt = device::madness_task_current_stream();
if (!stream_opt) {
auto stream_ord =
range.offset() % device::Env::instance()->num_streams_total();
return BLASQueuePool::queue(stream_ord);
} else
return BLASQueuePool::queue(*stream_opt);
}
} // namespace detail

} // namespace TiledArray

Expand Down
Loading