ValeevGroup · evaleev · Sep 29, 2023 · Sep 25, 2023 · Sep 26, 2023 · Sep 26, 2023
diff --git a/INSTALL.md b/INSTALL.md
@@ -42,7 +42,7 @@ Both methods are supported. However, for most users we _strongly_ recommend to b
   - Boost.Range: header-only, *only used for unit testing*
 - [BTAS](http://github.com/ValeevGroup/BTAS), tag 3c91f086090390930bba62c6512c4e74a5520e76 . If usable BTAS installation is not found, TiledArray will download and compile
   BTAS from source. *This is the recommended way to compile BTAS for all users*.
-- [MADNESS](https://github.com/m-a-d-n-e-s-s/madness), tag 3d585293f0094588778dbd3bec24b65e7bbe6a5d .
+- [MADNESS](https://github.com/m-a-d-n-e-s-s/madness), tag 1f307ebbe6604539493e165a7a2b00b366711fd8 .
   Only the MADworld runtime and BLAS/LAPACK C API component of MADNESS is used by TiledArray.
   If usable MADNESS installation is not found, TiledArray will download and compile
   MADNESS from source. *This is the recommended way to compile MADNESS for all users*.
@@ -68,7 +68,7 @@ Optional prerequisites:
     - [CUDA compiler and runtime](https://developer.nvidia.com/cuda-zone) -- for execution on NVIDIA's CUDA-enabled accelerators. CUDA 11 or later is required.
     - [HIP/ROCm compiler and runtime](https://developer.nvidia.com/cuda-zone) -- for execution on AMD's ROCm-enabled accelerators. Note that TiledArray does not use ROCm directly but its C++ Heterogeneous-Compute Interface for Portability, `HIP`; although HIP can also be used to program CUDA-enabled devices, in TiledArray it is used only to program ROCm devices, hence ROCm and HIP will be used interchangeably.
   - [LibreTT](github.com/victor-anisimov/LibreTT) -- free tensor transpose library for CUDA, ROCm, and SYCL platforms that is based on the [original cuTT library](github.com/ap-hynninen/cutt) extended to provide thread-safety improvements (via github.com/ValeevGroup/cutt) and extended to non-CUDA platforms by [@victor-anisimov](github.com/victor-anisimov) (tag 6eed30d4dd2a5aa58840fe895dcffd80be7fbece).
-  - [Umpire](github.com/LLNL/Umpire) -- portable memory manager for heterogeneous platforms (tag f9640e0fa4245691cdd434e4f719ac5f7d455f82).
+  - [Umpire](github.com/LLNL/Umpire) -- portable memory manager for heterogeneous platforms (tag 20839b2e8e8972070dd8f75c7f00d50d6c399716).
 - [Doxygen](http://www.doxygen.nl/) -- for building documentation (version 1.8.12 or later).
 - [ScaLAPACK](http://www.netlib.org/scalapack/) -- a distributed-memory linear algebra package. If detected, the following C++ components will also be sought and downloaded, if missing:
   - [scalapackpp](https://github.com/wavefunction91/scalapackpp.git) -- a modern C++ (C++17) wrapper for ScaLAPACK (tag 6397f52cf11c0dfd82a79698ee198a2fce515d81); pulls and builds the following additional prerequisite

diff --git a/doc/dox/dev/Optimization-Guide.md b/doc/dox/dev/Optimization-Guide.md
@@ -18,10 +18,8 @@ is devoted to communication. [Default = number of cores reported by ]
 
 ## MPI
 
-## CUDA
+## GPU/Device compute runtimes
 
-In addition to [the environment variables that control the CUDA runtime behavior](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars), several environment variables control specifically the execution of TiledArray on CUDA devices:
-* `TA_CUDA_NUM_STREAMS` -- The number of [CUDA streams](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) used to execute tasks on each device. Each stream can be viewed as a thread in a threadpool, with tasks in a given stream executing in order, but each stream executing independently of others. For small tasks this may need to be increased. [Default=3]
-* `CUDA_VISIBLE_DEVICES` -- This CUDA runtime environment variable is queried by TiledArray to determine whether CUDA devices on a multi-GPU node have been pre-mapped to MPI ranks.
-  * By default (i.e. when # of MPI ranks on a node <= # of _available_ CUDA devices) TiledArray will map 1 device (in the order of increasing rank) to each MPI rank.
-  * If # of available CUDA devices < # of MPI ranks on a node _and_ `CUDA_VISIBLE_DEVICES` is set TiledArray will assume that the user mapped the devices to the MPI ranks appropriately (e.g. using a resource manager like `jsrun`) and only checks that each rank has access to 1 CUDA device.
+In addition to the environment variables that control the runtime behavior of [CUDA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) and [HIP/ROCm](https://rocm.docs.amd.com/en/latest/search.html?q=environment+variables), several environment variables control specifically the execution of TiledArray on compute devices:
+* `TA_DEVICE_NUM_STREAMS` -- The number of [compute streams](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) used to execute tasks on each device. Each stream can be viewed as a thread in a threadpool, with tasks in a given stream executing in order, but each stream executing independently of others. For small tasks this may need to be increased. In addition stream for compute tasks TiledArray also creates 2 dedicated streams for data transfers to/from each device. [Default=3]
+* `CUDA_VISIBLE_DEVICES`/`HIP_VISIBLE_DEVICES` -- These runtime environment variables are can be used to map CUDA/HIP devices, respectively, on a multi-device node to MPI ranks. It is usually the responsibility of the resource manager to control this mapping, thus normally it should not be needed. By default TiledArray will assign compute devices on a multidevice node round robin to each MPI rank.
diff --git a/examples/device/device_task.cpp b/examples/device/device_task.cpp
@@ -15,8 +15,8 @@ using tile_type = TA::Tile<tensor_type>;
 /// verify the elements in tile is equal to value
 void verify(const tile_type& tile, value_type value, std::size_t index) {
   //  const auto size = tile.size();
-  std::string message = "verify Tensor: " + std::to_string(index) + '\n';
-  std::cout << message;
+  //  std::string message = "verify Tensor: " + std::to_string(index) + '\n';
+  //  std::cout << message;
   for (auto& num : tile) {
     if (num != value) {
       std::string error("Error: " + std::to_string(num) + " " +
@@ -29,33 +29,31 @@ void verify(const tile_type& tile, value_type value, std::size_t index) {
 }
 
 tile_type scale(const tile_type& arg, value_type a,
-                const TiledArray::device::stream_t* stream, std::size_t index) {
-  DeviceSafeCall(TiledArray::device::setDevice(
-      TiledArray::deviceEnv::instance()->current_device_id()));
+                TiledArray::device::Stream stream, std::size_t index) {
   /// make result Tensor
   using Storage = typename tile_type::tensor_type::storage_type;
   Storage result_storage;
   auto result_range = arg.range();
-  make_device_storage(result_storage, arg.size(), *stream);
+  make_device_storage(result_storage, arg.size(), stream);
 
   typename tile_type::tensor_type result(std::move(result_range),
                                          std::move(result_storage));
 
   /// copy the original Tensor
-  auto& queue = TiledArray::BLASQueuePool::queue(*stream);
+  auto& queue = TiledArray::BLASQueuePool::queue(stream);
 
   blas::copy(result.size(), arg.data(), 1, device_data(result.storage()), 1,
              queue);
 
   blas::scal(result.size(), a, device_data(result.storage()), 1, queue);
 
   //  std::stringstream stream_str;
-  //  stream_str << *stream;
+  //  stream_str << stream;
   //  std::string message = "run scale on Tensor: " + std::to_string(index) +
   //                        "on stream: " + stream_str.str() + '\n';
   //  std::cout << message;
 
-  TiledArray::device::synchronize_stream(stream);
+  TiledArray::device::sync_madness_task_with(stream);
 
   return tile_type(std::move(result));
 }
@@ -65,10 +63,10 @@ void process_task(madness::World* world, std::size_t ntask) {
   const std::size_t M = 1000;
   const std::size_t N = 1000;
 
-  std::size_t n_stream = TiledArray::deviceEnv::instance()->num_streams();
+  std::size_t n_stream = TiledArray::deviceEnv::instance()->num_streams_total();
 
   for (std::size_t i = 0; i < iter; i++) {
-    auto& stream = TiledArray::deviceEnv::instance()->stream(i % n_stream);
+    auto stream = TiledArray::deviceEnv::instance()->stream(i % n_stream);
 
     TiledArray::Range range{M, N};
 
@@ -77,12 +75,11 @@ void process_task(madness::World* world, std::size_t ntask) {
     const double scale_factor = 2.0;
 
     // function pointer to the scale function to call
-    tile_type (*scale_fn)(const tile_type&, double,
-                          const TiledArray::device::stream_t*, std::size_t) =
-        &::scale;
+    tile_type (*scale_fn)(const tile_type&, double, TiledArray::device::Stream,
+                          std::size_t) = &::scale;
 
     madness::Future<tile_type> scale_future = madness::add_device_task(
-        *world, ::scale, tensor, scale_factor, &stream, ntask * iter + i);
+        *world, ::scale, tensor, scale_factor, stream, ntask * iter + i);
 
     /// this should start until scale_taskfn is finished
     world->taskq.add(verify, scale_future, scale_factor, ntask * iter + i);

diff --git a/examples/device/ta_dense_device.cpp b/examples/device/ta_dense_device.cpp
@@ -303,13 +303,13 @@ int try_main(int argc, char **argv) {
             << runtimeVersion << std::endl;
 
   {  // print device properties
-    int num_devices = TA::deviceEnv::instance()->num_devices();
+    int num_devices = TA::deviceEnv::instance()->num_visible_devices();
 
     if (num_devices <= 0) {
       throw std::runtime_error("No GPUs Found!\n");
     }
 
-    int device_id = TA::deviceEnv::instance()->current_device_id();
+    const int device_id = TA::deviceEnv::instance()->current_device_id();
 
     int mpi_size = world.size();
     int mpi_rank = world.rank();

diff --git a/examples/device/ta_reduce_device.cpp b/examples/device/ta_reduce_device.cpp
@@ -17,9 +17,10 @@
  *
  */
 
-#include <TiledArray/device/btas_um_tensor.h>
 #include <tiledarray.h>
 
+#include <TiledArray/device/btas_um_tensor.h>
+
 template <typename Tile>
 void do_main_body(TiledArray::World &world, const long Nm, const long Bm,
                   const long Nn, const long Bn, const long nrepeat) {
@@ -291,13 +292,13 @@ int try_main(int argc, char **argv) {
             << runtimeVersion << std::endl;
 
   {  // print device properties
-    int num_devices = TA::deviceEnv::instance()->num_devices();
+    int num_devices = TA::deviceEnv::instance()->num_visible_devices();
 
     if (num_devices <= 0) {
       throw std::runtime_error("No GPUs Found!\n");
     }
 
-    int device_id = TA::deviceEnv::instance()->current_device_id();
+    const int device_id = TA::deviceEnv::instance()->current_device_id();
 
     int mpi_size = world.size();
     int mpi_rank = world.rank();

diff --git a/examples/device/ta_vector_device.cpp b/examples/device/ta_vector_device.cpp
@@ -308,13 +308,13 @@ int try_main(int argc, char **argv) {
             << runtimeVersion << std::endl;
 
   {  // print device properties
-    int num_devices = TA::deviceEnv::instance()->num_devices();
+    int num_devices = TA::deviceEnv::instance()->num_visible_devices();
 
     if (num_devices <= 0) {
       throw std::runtime_error("No GPUs Found!\n");
     }
 
-    int device_id = TA::deviceEnv::instance()->current_device_id();
+    const int device_id = TA::deviceEnv::instance()->current_device_id();
 
     int mpi_size = world.size();
     int mpi_rank = world.rank();

diff --git a/external/umpire.cmake b/external/umpire.cmake
@@ -48,9 +48,9 @@ else()
         set(enable_umpire_asserts ON)
     endif()
 
-    # as of now BLT only supports up to C++17, so limit CMAKE_CXX_STANDARD
+    # as of now BLT only supports up to C++20, so limit CMAKE_CXX_STANDARD
     set(BLT_CXX_STD ${CMAKE_CXX_STANDARD})
-    set(BLT_CXX_STD_MAX 17)
+    set(BLT_CXX_STD_MAX 20)
     if (BLT_CXX_STD GREATER ${BLT_CXX_STD_MAX})
         set(BLT_CXX_STD ${BLT_CXX_STD_MAX})
     endif()
@@ -161,7 +161,8 @@ else()
             )
 
     # TiledArray_UMPIRE target depends on existence of these directories to be usable from the build tree at configure time
-    execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_SOURCE_DIR}/src/umpire/tpl/camp/include")
+    execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_SOURCE_DIR}/src/tpl/umpire/camp/include")
+    execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_BUILD_DIR}/src/tpl/umpire/camp/include")
     execute_process(COMMAND ${CMAKE_COMMAND} -E make_directory "${EXTERNAL_BUILD_DIR}/include")
 
     # do install of Umpire as part of building TiledArray's install target
@@ -190,7 +191,7 @@ set_target_properties(
         TiledArray_UMPIRE
         PROPERTIES
         INTERFACE_INCLUDE_DIRECTORIES
-        "$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src>;$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src/umpire/tpl/camp/include>;$<BUILD_INTERFACE:${EXTERNAL_BUILD_DIR}/include>;$<INSTALL_INTERFACE:${_UMPIRE_INSTALL_DIR}/include>"
+        "$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src>;$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src/tpl>;$<BUILD_INTERFACE:${EXTERNAL_SOURCE_DIR}/src/tpl/umpire/camp/include>;$<BUILD_INTERFACE:${EXTERNAL_BUILD_DIR}/src/tpl/umpire/camp/include>;$<BUILD_INTERFACE:${EXTERNAL_BUILD_DIR}/include>;$<INSTALL_INTERFACE:${_UMPIRE_INSTALL_DIR}/include>"
         INTERFACE_LINK_LIBRARIES
         "$<BUILD_INTERFACE:${UMPIRE_BUILD_BYPRODUCTS}>;$<INSTALL_INTERFACE:${_UMPIRE_INSTALL_DIR}/lib/libumpire${UMPIRE_DEFAULT_LIBRARY_SUFFIX}>"
         )

diff --git a/external/versions.cmake b/external/versions.cmake
@@ -19,8 +19,8 @@ set(TA_INSTALL_EIGEN_PREVIOUS_VERSION 3.3.7)
 set(TA_INSTALL_EIGEN_URL_HASH SHA256=b4c198460eba6f28d34894e3a5710998818515104d6e74e5cc331ce31e46e626)
 set(TA_INSTALL_EIGEN_PREVIOUS_URL_HASH MD5=b9e98a200d2455f06db9c661c5610496)
 
-set(TA_TRACKED_MADNESS_TAG 3d585293f0094588778dbd3bec24b65e7bbe6a5d)
-set(TA_TRACKED_MADNESS_PREVIOUS_TAG 4785f17bec34e08f10fa4de84c7359f0404a4d78)
+set(TA_TRACKED_MADNESS_TAG 1f307ebbe6604539493e165a7a2b00b366711fd8)
+set(TA_TRACKED_MADNESS_PREVIOUS_TAG 3d585293f0094588778dbd3bec24b65e7bbe6a5d)
 set(TA_TRACKED_MADNESS_VERSION 0.10.1)
 set(TA_TRACKED_MADNESS_PREVIOUS_VERSION 0.10.1)
 
@@ -30,8 +30,8 @@ set(TA_TRACKED_BTAS_PREVIOUS_TAG 5a45699b78d0540b490c8c769b61033bd4d4f49c)
 set(TA_TRACKED_LIBRETT_TAG 6eed30d4dd2a5aa58840fe895dcffd80be7fbece)
 set(TA_TRACKED_LIBRETT_PREVIOUS_TAG 354e0ccee54aeb2f191c3ce2c617ebf437e49d83)
 
-set(TA_TRACKED_UMPIRE_TAG f9640e0fa4245691cdd434e4f719ac5f7d455f82)
-set(TA_TRACKED_UMPIRE_PREVIOUS_TAG v6.0.0)
+set(TA_TRACKED_UMPIRE_TAG 20839b2e8e8972070dd8f75c7f00d50d6c399716)
+set(TA_TRACKED_UMPIRE_PREVIOUS_TAG v2023.06.0)
 
 set(TA_TRACKED_SCALAPACKPP_TAG 6397f52cf11c0dfd82a79698ee198a2fce515d81)
 set(TA_TRACKED_SCALAPACKPP_PREVIOUS_TAG 711ef363479a90c88788036f9c6c8adb70736cbf )

diff --git a/src/TiledArray/conversions/clone.h b/src/TiledArray/conversions/clone.h
@@ -26,6 +26,10 @@
 #ifndef TILEDARRAY_CONVERSIONS_CLONE_H__INCLUDED
 #define TILEDARRAY_CONVERSIONS_CLONE_H__INCLUDED
 
+#ifdef TILEDARRAY_HAS_DEVICE
+#include "TiledArray/device/device_task_fn.h"
+#endif
+
 namespace TiledArray {
 
 /// Forward declarations
@@ -53,12 +57,28 @@ inline DistArray<Tile, Policy> clone(const DistArray<Tile, Policy>& arg) {
     if (arg.is_zero(index)) continue;
 
     // Spawn a task to clone the tiles
-    Future<value_type> tile = world.taskq.add(
-        [](const value_type& tile) -> value_type {
-          using TiledArray::clone;
-          return clone(tile);
-        },
-        arg.find(index));
+
+    Future<value_type> tile;
+    if constexpr (!detail::is_device_tile_v<value_type>) {
+      tile = world.taskq.add(
+          [](const value_type& tile) -> value_type {
+            using TiledArray::clone;
+            return clone(tile);
+          },
+          arg.find(index));
+    } else {
+#ifdef TILEDARRAY_HAS_DEVICE
+      tile = madness::add_device_task(
+          world,
+          [](const value_type& tile) -> value_type {
+            using TiledArray::clone;
+            return clone(tile);
+          },
+          arg.find(index));
+#else
+      abort();  // unreachable
+#endif
+    }
 
     // Store result tile
     result.set(index, tile);

diff --git a/src/TiledArray/device/blas.cpp b/src/TiledArray/device/blas.cpp
@@ -31,27 +31,27 @@ bool BLASQueuePool::initialized() { return !queues_.empty(); }
 
 void BLASQueuePool::initialize() {
   if (initialized()) return;
-  queues_.reserve(deviceEnv::instance()->num_streams());
-  for (std::size_t sidx = 0; sidx != deviceEnv::instance()->num_streams();
+  queues_.reserve(deviceEnv::instance()->num_streams_total());
+  for (std::size_t sidx = 0; sidx != deviceEnv::instance()->num_streams_total();
        ++sidx) {
-    auto stream = deviceEnv::instance()->stream(
+    auto q = deviceEnv::instance()->stream(
         sidx);  // blaspp forsome reason wants non-const lvalue ref to stream
-    queues_.emplace_back(std::make_unique<blas::Queue>(0, stream));
+    queues_.emplace_back(std::make_unique<blas::Queue>(q.device, q.stream));
   }
 }
 
 void BLASQueuePool::finalize() { queues_.clear(); }
 
 blas::Queue& BLASQueuePool::queue(std::size_t ordinal) {
   TA_ASSERT(initialized());
-  TA_ASSERT(ordinal < deviceEnv::instance()->num_streams());
+  TA_ASSERT(ordinal < deviceEnv::instance()->num_streams_total());
   return *(queues_[ordinal]);
 }
 
-blas::Queue& BLASQueuePool::queue(device::stream_t const& stream) {
+blas::Queue& BLASQueuePool::queue(device::Stream const& stream) {
   TA_ASSERT(initialized());
   for (auto&& q : queues_) {
-    if (q->stream() == stream) return *q;
+    if (q->device() == stream.device && q->stream() == stream.stream) return *q;
   }
   throw TiledArray::Exception(
       "no matching device stream found in the BLAS queue pool");

diff --git a/src/TiledArray/device/blas.h b/src/TiledArray/device/blas.h
@@ -28,19 +28,13 @@
 
 #ifdef TILEDARRAY_HAS_DEVICE
 
-#include <TiledArray/external/device.h>
-
 #include <TiledArray/error.h>
+#include <TiledArray/external/device.h>
 #include <TiledArray/tensor/complex.h>
-
-#include <TiledArray/math/blas.h>
+#include <blas/device.hh>
 
 namespace TiledArray {
 
-/*
- * cuBLAS interface functions
- */
-
 /**
  * BLASQueuePool is a singleton controlling a pool of blas::Queue objects:
  * - queues map to stream 1-to-1, so do not call Queue::set_stream to maintain
@@ -54,20 +48,29 @@ struct BLASQueuePool {
   static void finalize();
 
   static blas::Queue &queue(std::size_t ordinal = 0);
-  static blas::Queue &queue(const device::stream_t &stream);
+  static blas::Queue &queue(const device::Stream &s);
 
  private:
   static std::vector<std::unique_ptr<blas::Queue>> queues_;
 };
 
-namespace detail {
+/// maps a (tile) Range to blas::Queue; if had already pushed work into a
+/// device::Stream (as indicated by madness_task_current_stream() )
+/// will return that Stream instead
+/// @param[in] range will determine the device::Stream to compute an object
+/// associated with this Range object
+/// @return the device::Stream to use for creating tasks generating work
+/// associated with Range \p range
 template <typename Range>
-blas::Queue &get_blasqueue_based_on_range(const Range &range) {
-  // TODO better way to get stream based on the id of tensor
-  auto stream_ord = range.offset() % device::Env::instance()->num_streams();
-  return BLASQueuePool::queue(stream_ord);
+blas::Queue &blasqueue_for(const Range &range) {
+  auto stream_opt = device::madness_task_current_stream();
+  if (!stream_opt) {
+    auto stream_ord =
+        range.offset() % device::Env::instance()->num_streams_total();
+    return BLASQueuePool::queue(stream_ord);
+  } else
+    return BLASQueuePool::queue(*stream_opt);
 }
-}  // namespace detail
 
 }  // namespace TiledArray