Fdat/mpi test implementation #1

fda-tome · 2023-08-01T18:25:18Z

No description provided.

The worker scheduler would previously assume that it was fine to schedule infinite amounts of work onto the same processor at once, which is only efficient when tasks do lots of `yield`ing. Because most tasks do not actually exhibit low occupancy, we want to teach at least the worker scheduler to limit its eagerness when executing high-occupancy tasks. This commit teaches `@spawn` and the worker scheduler about a new `occupancy` task option, which (on the user side) is a value between 0 and 1 which approximates how fully the task occupies the processor. If the occupancy is 0.2, then 5 such tasks can execute concurrently and fully occupy the processor. Processors now operate primarily from a single controlling task per processor, and work is executed in a lowest-occupancy-first manner to attempt to maximize throughput. With processors using occupancy estimates to limit oversubscription, it's now quite easy for tasks to become starved for work. This commit also adds work-stealing logic to each processor, allowing a starved processor to steal scope-compatible tasks from other busy processors. Processors will be able to steal so long as they are not fully occupied.

Implement work stealing

APIs like `delayed` and `spawn` assumed that passed kwargs were to be treated as options to the scheduler, which is both somewhat confusing for users, and precludes passing kwargs to user functions. This commit changes those APIs, as well as `@spawn`, to instead pass kwargs directly to the user's function. Options are now passed in an `Options` struct to `delayed` and `spawn` as the second argument (the first being the function), while `@spawn` still keeps them before the call (which is generally more convenient). Internally, `Thunk`'s `inputs` field is now a `Vector{Pair{Union{Symbol,Nothing},Any}}`, where the second element of each pair is the argument, while the first element is a position; if `nothing`, it's a positional argument, and if a `Symbol`, then it's a kwarg.

Add keyword argument support

Co-authored-by: Julian P Samaroo <[email protected]>

The docs were far too wordy for what's supposed to be an introduction to Dagger. This commit moves most of that information to separate deep-dive pages, and sets up the quickstart page to have simple examples of how to do common tasks. The hope is that new users will start by adapting the simple example, and read more into the deep dive when they want to do something fancy or learn how the example works.

…docs Simplify and streamline docs, add Shard iteration

enable dependabot for GitHub actions

Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v2...v3) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]>

…rrect-eltype DArray: Figure out correct eltype

…pport DArray: Improve DiffEq support

Add logo to README

README: Add better logo

…thub_actions/actions/checkout-3 build(deps): bump actions/checkout from 2 to 3

To allow Dagger to fulfill more use cases, it would be helpful if we could perform some basic transformations on tasks before they get submitted to the scheduler. For example, we would like to enable in-order execution semantics for regions of code that execute GPU kernels, as this matches GPU synchronization semantics, and thus makes it easier to upgrade GPU code to use Dagger. Separately, to enable reliable DAG optimizations, we need to be able to guarantee that a region of user code can be seen as a whole within the scheduler. The one-at-a-time task submission semantics that we currently have are insufficient to achieve this, as some tasks in the DAG region of interest may already have been launched before the optimization can see enough of the DAG to be useful. To support these and other use cases, this commit adds a flexible pre-submit task queueing system, as well as making it possible to add additional tasks as synchronization dependencies (instead of the default set from task arguments). The task queueing system allows a custom task queue to be set in TLS, which will be used by `@spawn`/`spawn` when submitting tasks. The task queue is provided one or more task specifications and `EagerThunk` handles, and is free to delay and/or batch task submission, as well as to modify the task specification arbitrarily to match the desired semantics. Task queues are nestable, and tasks submitted within sets of nested task queues should inherit the semantics of the queue they are contained within most directly (with further transformations occuring as tasks move upwards within the nest of queues). The most upstream task queue submits tasks to the worker 1 eager scheduler, but this is also expected to be flexible to allow unique task submission semantics. To support the goal of predictable optimizations, a `LazyTaskQueue` is added (available via `spawn_bulk`) which batches up multiple task submissions into just one, and locks the scheduler until all tasks have been submitted, allowing the scheduler to see the entire DAG structure all at once. Nesting of `spawn_bulk` queues allows multiple DAG regions to be combined into a single total region which is submitted all at once. The ability to specify additional task synchronization dependencies is also a key piece that is orthogonal to task queues. This feature enables the goal of in-order execution semantics by enabling the creation of an `InOrderTaskQueue` (available via `spawn_sequential`), which tracks the last-submitted task or set of tasks, and adds those tasks as additional synchronization dependencies to the next submitted task or set of tasks, effectively causing serializing behavior. Nesting of `spawn_sequential` queues allows separate sequential chains of tasks to be specified, with deeper-nested chains sequencing after previously-submitted tasks or chains in shallower-nested queues. Interestingly, the nesting of `spawn_bulk` within `spawn_sequential` allows entire DAG regions to explicitly synchronize against each other (such that one region executes before another), while allowing tasks within each region to still expose parallelism. The inverse nesting of `spawn_sequential` within `spawn_bulk` allows a chain of sequential tasks to be submitted all at once, adding interesting optimization opportunities. Alongside these enhancements, the eager submission pipeline is optimized by removing the `eager_thunk` submission pathway (which submitted all tasks into a `Channel`), and allows tasks to be directly submitted into the scheduler without redirection. It is expected that this will improve task submission performance and reduce memory usage.

Implement task queues for configurable task launch

Add DArray documentation

To enable MPI support in the DArray (which is best implemented as each MPI rank holding only a single local partition), this commit splits AbstractBlock further by multi-partition or single-partition storage schemes, where `Blocks <: AbstractMultiBlocks`, and a future `MPIBlocks <: AbstractSingleBlocks`. Additionally, a DArray ctor is added for when only a single subdomain and chunk/thunk is provided. For easier post-hoc repartitioning of DArrays, we now store the original user-provided partitioning scheme within the DArray, and also add it as a type parameter. This also assists future MPI integration by allowing for operations to dispatch on the partitioning scheme. Finally, this commit also adjusts all DArray operations to return DArrays, so that no lazy operators like MatMul or Map are returned to the user. While it would be nice to be able to work with these operators directly, the array ecosystem has generally settled on propagating arrays of the same or similar types as the inputs to many operations. The operators themselves are still present behind a single materializing call, so it should be possible to expose them again in the future if optimization opportunities become feasible.

…ngle-part DArray: Operations always return DArrays, and add local partition support for MPI

Allows the `DArray` to participate in MPI operations, by building on the newly-added support for "MPI-style" partitioning (i.e. only the rank-local partition is stored), via a new `MPIBlocks` partitioner. This commit also implements MPI-powered `distribute` and `reduce` operations for `MPIBlocks`-partitioned arrays, which support a variety of distribution schemes and data transfer modes.

ranocha and others added 30 commits March 14, 2023 13:46

enable dependabot for GitHub actions

7c6c18a

ThreadProc: Mark task sticky

284a374

ThreadProc: Use at-invokelatest

44913e8

thunk: Dont move (error,value) tuple

96b2c6b

Sch: Generate guaranteed unique uid

bcf3f32

Sch: Add at-dagdebug helper macro

57cbaf0

Sch: Use at-invokelatest for move

14dc2b5

Merge pull request JuliaParallel#373 from JuliaParallel/jps/task-balance

4b83c4b

Implement work stealing

CI: Add Julia 1.9

1c8878d

Merge pull request JuliaParallel#394 from JuliaParallel/jps/kwargs

5836c4d

Add keyword argument support

Bump to 0.17.0

0ffc940

Add DataStructures compat

f7c936f

DArray: Use eager API (JuliaParallel#396)

09f70b0

Co-authored-by: Julian P Samaroo <[email protected]>

Shard: Define iteration over chunks

80be019

DaggerWebDash: Support Dagger 0.17

99b5c61

docs: Commit Manifest

f74738e

CI: Tweak agent spec

5947a21

Merge pull request JuliaParallel#399 from JuliaParallel/jps/simplify-…

966d11e

…docs Simplify and streamline docs, add Shard iteration

Merge pull request JuliaParallel#382 from ranocha/hr/dependabot

f94be28

enable dependabot for GitHub actions

DArray: Figure out correct eltype

f1ebe06

CI: Don't require x86_64 for tests

6e0704c

Merge pull request JuliaParallel#401 from JuliaParallel/jps/darray-co…

51fdb5e

…rrect-eltype DArray: Figure out correct eltype

DArray: Improve DiffEq support

7e09407

Merge pull request JuliaParallel#402 from JuliaParallel/jps/diffeq-su…

f8e2197

…pport DArray: Improve DiffEq support

Add logo to README

1ea626c

Merge pull request JuliaParallel#403 from JuliaParallel/jps/logo

3bff0f6

Add logo to README

jpsamaroo and others added 20 commits July 15, 2023 15:44

README: Add better logo

6ce4355

Merge pull request JuliaParallel#404 from JuliaParallel/jps/logo-jpg

0d00958

README: Add better logo

Sch: Improve behavior on internal error

2d6179b

Merge pull request JuliaParallel#400 from JuliaParallel/dependabot/gi…

8a98558

…thub_actions/actions/checkout-3 build(deps): bump actions/checkout from 2 to 3

Remove PLUGINS legacy code

3abaa9f

Sch: Enable :move logs

3582b42

Move at-dagdebug into utils

9016977

Sch: Fix dequeue code in processor runner

2075371

estimate_task_costs: Reduce allocations

81eea2b

Sch: Make inbound queue size unlimited

6fd94eb

processor: Workaround dynamic dispatch in prio queue

1ec8563

Merge pull request JuliaParallel#398 from JuliaParallel/jps/task-queues

b461012

Implement task queues for configurable task launch

Add DArray documentation

07e2104

Merge pull request JuliaParallel#336 from mkitti/mkitti/darray_docs

f63514a

Add DArray documentation

Assorted NFC renames

0fe91dc

Merge pull request JuliaParallel#408 from JuliaParallel/jps/darray-si…

eee71ec

…ngle-part DArray: Operations always return DArrays, and add local partition support for MPI

Collection through ranks and auto-distribution

65fb8cf

fda-tome closed this Aug 1, 2023

fda-tome reopened this Aug 1, 2023

fda-tome merged commit c74196f into master Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fdat/mpi test implementation #1

Fdat/mpi test implementation #1

fda-tome commented Aug 1, 2023

Fdat/mpi test implementation #1

Fdat/mpi test implementation #1

Conversation

fda-tome commented Aug 1, 2023