Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fdat/mpi test implementation #1

Merged
merged 50 commits into from
Aug 1, 2023
Merged

Conversation

fda-tome
Copy link
Owner

@fda-tome fda-tome commented Aug 1, 2023

No description provided.

ranocha and others added 30 commits March 14, 2023 13:46
The worker scheduler would previously assume that it was fine to
schedule infinite amounts of work onto the same processor at once, which
is only efficient when tasks do lots of `yield`ing. Because most tasks
do not actually exhibit low occupancy, we want to teach at least the
worker scheduler to limit its eagerness when executing high-occupancy
tasks.

This commit teaches `@spawn` and the worker scheduler about a new
`occupancy` task option, which (on the user side) is a value between 0
and 1 which approximates how fully the task occupies the processor. If
the occupancy is 0.2, then 5 such tasks can execute concurrently and
fully occupy the processor.

Processors now operate primarily from a single controlling task per
processor, and work is executed in a lowest-occupancy-first manner to
attempt to maximize throughput.

With processors using occupancy estimates to limit oversubscription,
it's now quite easy for tasks to become starved for work. This commit
also adds work-stealing logic to each processor, allowing a starved
processor to steal scope-compatible tasks from other busy processors.
Processors will be able to steal so long as they are not fully occupied.
APIs like `delayed` and `spawn` assumed that passed kwargs were to be
treated as options to the scheduler, which is both somewhat confusing
for users, and precludes passing kwargs to user functions.

This commit changes those APIs, as well as `@spawn`, to instead pass
kwargs directly to the user's function. Options are now passed in an
`Options` struct to `delayed` and `spawn` as the second argument (the
first being the function), while `@spawn` still keeps them before the
call (which is generally more convenient).

Internally, `Thunk`'s `inputs` field is now a
`Vector{Pair{Union{Symbol,Nothing},Any}}`, where the second element of
each pair is the argument, while the first element is a position; if
`nothing`, it's a positional argument, and if a `Symbol`, then it's a
kwarg.
The docs were far too wordy for what's supposed to be an introduction to
Dagger. This commit moves most of that information to separate deep-dive
pages, and sets up the quickstart page to have simple examples of how to
do common tasks. The hope is that new users will start by adapting the
simple example, and read more into the deep dive when they want to do
something fancy or learn how the example works.
…docs

Simplify and streamline docs, add Shard iteration
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v2...v3)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
…rrect-eltype

DArray: Figure out correct eltype
jpsamaroo and others added 20 commits July 15, 2023 15:44
…thub_actions/actions/checkout-3

build(deps): bump actions/checkout from 2 to 3
To allow Dagger to fulfill more use cases, it would be helpful if we
could perform some basic transformations on tasks before they get
submitted to the scheduler. For example, we would like to enable
in-order execution semantics for regions of code that execute GPU
kernels, as this matches GPU synchronization semantics, and thus makes
it easier to upgrade GPU code to use Dagger.

Separately, to enable reliable DAG optimizations, we need to be able to
guarantee that a region of user code can be seen as a whole within the
scheduler. The one-at-a-time task submission semantics that we currently
have are insufficient to achieve this, as some tasks in the DAG region
of interest may already have been launched before the optimization can
see enough of the DAG to be useful.

To support these and other use cases, this commit adds a flexible
pre-submit task queueing system, as well as making it possible to add
additional tasks as synchronization dependencies (instead of the default
set from task arguments).

The task queueing system allows a custom task queue to be set in TLS,
which will be used by `@spawn`/`spawn` when submitting tasks. The task
queue is provided one or more task specifications and `EagerThunk`
handles, and is free to delay and/or batch task submission, as well as
to modify the task specification arbitrarily to match the desired
semantics. Task queues are nestable, and tasks submitted within sets of
nested task queues should inherit the semantics of the queue they are
contained within most directly (with further transformations occuring as
tasks move upwards within the nest of queues). The most upstream task
queue submits tasks to the worker 1 eager scheduler, but this is also
expected to be flexible to allow unique task submission semantics.

To support the goal of predictable optimizations, a `LazyTaskQueue` is
added (available via `spawn_bulk`) which batches up multiple task
submissions into just one, and locks the scheduler until all tasks have
been submitted, allowing the scheduler to see the entire DAG structure
all at once. Nesting of `spawn_bulk` queues allows multiple DAG regions
to be combined into a single total region which is submitted all at
once.

The ability to specify additional task synchronization dependencies is also a
key piece that is orthogonal to task queues. This feature enables the
goal of in-order execution semantics by enabling the creation of an
`InOrderTaskQueue` (available via `spawn_sequential`), which tracks the
last-submitted task or set of tasks, and adds those tasks as additional
synchronization dependencies to the next submitted task or set of tasks,
effectively causing serializing behavior. Nesting of `spawn_sequential`
queues allows separate sequential chains of tasks to be specified, with
deeper-nested chains sequencing after previously-submitted tasks or
chains in shallower-nested queues.

Interestingly, the nesting of `spawn_bulk` within `spawn_sequential`
allows entire DAG regions to explicitly synchronize against each other
(such that one region executes before another), while allowing tasks
within each region to still expose parallelism. The inverse nesting of
`spawn_sequential` within `spawn_bulk` allows a chain of sequential
tasks to be submitted all at once, adding interesting optimization
opportunities.

Alongside these enhancements, the eager submission pipeline is optimized
by removing the `eager_thunk` submission pathway (which submitted all
tasks into a `Channel`), and allows tasks to be directly submitted into
the scheduler without redirection. It is expected that this will improve
task submission performance and reduce memory usage.
Implement task queues for configurable task launch
To enable MPI support in the DArray (which is best implemented as each
MPI rank holding only a single local partition), this commit splits
AbstractBlock further by multi-partition or single-partition storage
schemes, where `Blocks <: AbstractMultiBlocks`, and a future `MPIBlocks
<: AbstractSingleBlocks`. Additionally, a DArray ctor is added for when
only a single subdomain and chunk/thunk is provided.

For easier post-hoc repartitioning of DArrays, we now store the original
user-provided partitioning scheme within the DArray, and also add it as
a type parameter. This also assists future MPI integration by allowing
for operations to dispatch on the partitioning scheme.

Finally, this commit also adjusts all DArray operations to return
DArrays, so that no lazy operators like MatMul or Map are returned to
the user. While it would be nice to be able to work with these
operators directly, the array ecosystem has generally settled on
propagating arrays of the same or similar types as the inputs to many
operations. The operators themselves are still present behind a single
materializing call, so it should be possible to expose them again in the
future if optimization opportunities become feasible.
…ngle-part

DArray: Operations always return DArrays, and add local partition support for MPI
Allows the `DArray` to participate in MPI operations, by building on the
newly-added support for "MPI-style" partitioning (i.e. only the
rank-local partition is stored), via a new `MPIBlocks` partitioner. This
commit also implements MPI-powered `distribute` and `reduce` operations
for `MPIBlocks`-partitioned arrays, which support a variety of
distribution schemes and data transfer modes.
@fda-tome fda-tome closed this Aug 1, 2023
@fda-tome fda-tome reopened this Aug 1, 2023
@fda-tome fda-tome merged commit c74196f into master Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants