Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed mixed FFT / vertical tridiagonal solver #2538

Closed
wants to merge 46 commits into from

Conversation

glwagner
Copy link
Member

@glwagner glwagner commented May 8, 2022

This PR builds off #2536 and implements a distributed Poisson solver that users horizontal FFTs and a vertical tridiagonal solve, with more help from @jipolanco.

When distributed in (x, y), this kind of solver is more expensive than a pure FFT-based solver, because it requires 4 additional transpositions + communication.

For problems that are only distributed in x or y (eg, slab decomposition), we can avoid the additional transpositions. Implementing that optimization is TODO for this PR.

Some of the details are discussed on jipolanco/PencilFFTs.jl#44.

Future work, which would require abstracting the implementation of hydrostatic pressure in NonhydrostaticModel (and, for friendliness, forbidding the use of VerticallyImplicitTimeDiscretization), could in principle support a more efficient version of this solver with pencil decomposition in (y, z) or (x, z). This memory layout would increase performance for very large problems that require a 2D domain decomposition, since decomposing in (y, z) or (x, z) reduces the number of transposes needed by 4 over (x, y). This feature is easy to code, but might take some time to test. We've already noticed on #1910 that lumping hydrostatic and nonhydrostatic pressure produces different (perhaps lower quality) solutions.

TODO:

  • Implement a more efficient algorithm for 1D "slab" decompositions
  • Add tests

@glwagner glwagner marked this pull request as draft May 8, 2022 15:01
@glwagner
Copy link
Member Author

glwagner commented May 8, 2022

@jipolanco I just send up another big commit --- I realized that we could use extra_dims when we use a 1D process grid + tridiagonal to save much communication:

else # z is local, and we are good to go!
= solver.unpermuted_right_hand_side
= solver.tridiagonal_storage
b̂ .=
solve!(x̂, solver.tridiagonal_vertical_solver, b̂)
end

and

if effective_topology[3] isa Flat
# If the _effective_ z-topology is Flat --- which occurs either because topology(global_grid, 3)
# is actually Flat, or because we are using_tridiagonal_vertical_solver --- we use "extra_dims"
# to represent the vertical (3rd) dimension to avoid needless transform / permutation.
#
# With this option, we have to remove the last element from both the permuted size and process grid.
# Note that setting extra_dims = tuple(gNz) means that input / storage will have size (N1, N2, gNz).
extra_dims = tuple(gNz)
permuted_size = permuted_size[1:2]
processors_per_dimension = tuple(processors_per_dimension[1])
else
extra_dims = ()
end
end
# Reduce inputs for vertical tridiagonal solver
Ntransforms = length(permuted_size)
transform_dimensions = Tuple(input_permutation[i] for i = 1:Ntransforms)
transforms = Tuple(infer_transform(effective_topology[d]) for d in transform_dimensions)
@info "Building PencilFFTPlan with transforms $transforms, process grid $processors_per_dimension, and extra_dims $extra_dims..."
plan = PencilFFTPlan(permuted_size, transforms, processors_per_dimension, communicator; extra_dims)

Let me know what you think.

@glwagner glwagner changed the title [WIP] Distributed mixed FFT / vertical tridiagonal solver Distributed mixed FFT / vertical tridiagonal solver May 8, 2022
@glwagner glwagner marked this pull request as ready for review May 8, 2022 20:13
@glwagner glwagner requested a review from simone-silvestri May 8, 2022 20:14
@tomchor
Copy link
Collaborator

tomchor commented Jul 18, 2022

@glwagner it seems the only reason the tests aren't passing is because mpiexecjl isn't properly linked:

/bin/bash: /storage5/buildkite-agent/.julia-7523/bin/mpiexecjl: No such file or directory 

Maybe fix that and merge since (apparently) this PR is otherwise ready to go?

@glwagner
Copy link
Member Author

There's a less trivial error here: https://buildkite.com/clima/oceananigans/builds/7523#311535ff-f56d-410e-8571-c3b1d9757daf

I'll try to restart the whole build and see what happens.

@tomchor
Copy link
Collaborator

tomchor commented Jul 25, 2022

Just realized the distributed tests have been running for 6 days. I guess it's fair to say there's still something to fix lol

Just killed it to save resources

@glwagner
Copy link
Member Author

@tomchor are you able to test locally? I believe these passed locally for me, so the problem might be relatively easy to solve.

@tomchor
Copy link
Collaborator

tomchor commented Jul 25, 2022

@tomchor are you able to test locally? I believe these passed locally for me, so the problem might be relatively easy to solve.

I've never tested anything in parallel locally, but I can definitely try

@tomchor
Copy link
Collaborator

tomchor commented Jul 25, 2022

@glwagner I ran the tests and they got stuck in the same place where this test got stuck. So it appears that there's something to be fixed here...

@navidcy
Copy link
Collaborator

navidcy commented Sep 19, 2022

@glwagner how do you run locally? do you use mpiexecjl?

@glwagner
Copy link
Member Author

This is stale and has been superceded

@glwagner glwagner closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants