Skip to content

First stage_parallelization

Derek Lamb edited this page Sep 19, 2015 · 3 revisions

First-stage parallelization

As of version 3.0, FLUX is first-stage parallelized. That means critical operations (notably force and neighbor updating) can be spread across processors. The mechanism is the regular UNIX fork call: at each parallelized step, several daughter processes are forked off to carry out the work. The updated simulation state is returned to the parent process via a pipe from each process.

Parallelization is controlled via the concurrency field of the Flux::World object, which contains the number of daughter processes to spawn. Most entry points to the FLUX library are not parallelized: only the ones with _parallel suffixes are.

Mechanism

The parallelization is somewhat crude at present -- it uses the three routines parallel_prep, parallel_fluxon_spawn_springboard, and parallel_finish. Those, in turn, make use of the binary dump routines in io.c. The prep/springboard/finish sequence makes use of some global variables, so it is not nestable - don't try to re-parallelize an already parallel task!

The parallelization code parallelizes arena-wide tasks by fluxon: the spawner attempts to break up the simulation approximately equally by number of vertices. It accumulates a DUMBLIST of fluxons to be processed, keeping track of the total number of vertices in the colleciton. When the number of accumulated vertices exceeds the per-process quota, the code opens a pipe, calls fork and the daughter process goes to work on the fluxons in the current DUMBLIST. When it has finished, it dumps the associated vertices over the pipe, using the binary_dump_fluxon_pipe routine in io.c, and exits.

When all the daughter processes have been spawned, the master process waits for each daughter to exit, and then snarfs up all of the VERTEXes via the pipe, using binary_read_dumpfile in io.c. As a side effect, binary_read_dumpfile updates the WORLD's vertex statistical accumulators, so that (e.g.) the curvature and force min/max info are stored without having to make an additional pass through the whole arena.

Extending the parallelization

To write new parallized methods, you should place them in model.c, so that you can update some static global variables in that module. Follow the model of world_relax_step_parallel, which sets the global fluxon_calc_step_sb to point to the appropriate springboard routine, and makes use of the parallel_prep, parallel_fluxon_spawn_springboard, and parallel_finish routines. If you do something other than modify all the vertices in place, you will need to write a different binary dump/snarf routine pair to get the state info back to the master process. (The current one, binary_dump_fluxon_pipe, assumes that no vertices are created or deleted). Note that, because the daughter processes are clones of the parent, pointer values of pre-existing data structures are preserved -- that makes binary_dump_fluxon_pipe quite a bit faster, because there's no need to search by label for each data structure being updated.