Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CUDA devcontainer #751

Merged
merged 19 commits into from
Sep 29, 2024
Merged

Add a CUDA devcontainer #751

merged 19 commits into from
Sep 29, 2024

Conversation

chongchonghe
Copy link
Contributor

@chongchonghe chongchonghe commented Sep 25, 2024

Description

Adds a new CUDA devcontainer. Now you can choose from gcc-container and cuda-container.

When using the default compiler and MPI library in the CUDA container, it is necessary to add these options to the CMake configuration flags when building Quokka:

$ export PATH=$PATH:/workspaces/quokka/scripts
$ cmake .. -DCMAKE_TEST_LAUNCHER=mpirun-singleton-wrapper -DDISABLE_FMAD=OFF

For details on CMAKE_TEST_LAUNCHER, see:

Related issues

Closes #745.

Checklist

Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an x inside the square brackets [ ] in the Markdown source below:

  • I have added a description (see above).
  • I have added a link to any related issues see (see above).
  • I have read the Contributing Guide.
  • I have added tests for any new physics that this PR adds to the code.
  • I have tested this PR on my local computer and all tests pass.
  • I have manually triggered the GPU tests with the magic comment /azp run.
  • I have requested a reviewer for this PR.

@BenWibking
Copy link
Collaborator

The issue is that Python is not installed. extern_parameters.H is generated by a Python script inside Microphysics.

@chongchonghe
Copy link
Contributor Author

The issue is that Python is not installed. extern_parameters.H is generated by a Python script inside Microphysics.

I see. I thought since we are not using matplotlib in GPU runs, we don't need to install Python.

@BenWibking
Copy link
Collaborator

The issue is that Python is not installed. extern_parameters.H is generated by a Python script inside Microphysics.

I see. I thought since we are not using matplotlib in GPU runs, we don't need to install Python.

No, unfortunately Python is a required build-time dependency now, since it is required by Microphysics.

@BenWibking
Copy link
Collaborator

We should probably update the README to note that Python is required, rather than optional.

@BenWibking
Copy link
Collaborator

I've created a PR to update the README: #753

@chongchonghe
Copy link
Contributor Author

@BenWibking did you make it work? I got the following error:

------
Dockerfile:2
--------------------
   1 |     # Use the NVIDIA CUDA image as the base image
   2 | >>> FROM nvcr.io/nvidia/nvhpc:24.7-devel-cuda12.5-ubuntu22.04
   3 |     # FROM nvcr.io/nvidia/nvhpc:24.9-runtime-cuda11.8-ubuntu22.04
   4 |
--------------------
ERROR: failed to solve: nvcr.io/nvidia/nvhpc:24.7-devel-cuda12.5-ubuntu22.04: failed to resolve source metadata for nvcr.io/nvidia/nvhpc:24.7-devel-cuda12.5-ubuntu22.04: failed to do request: Head "https://nvcr.io/v2/nvidia/nvhpc/manifests/24.7-devel-cuda12.5-ubuntu22.04": net/http: TLS handshake timeout

I also tried a few other images on this page and got the same error.

@BenWibking
Copy link
Collaborator

The full list is here: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc/tags.

I am able to download the nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04 container:

[41382 ms] Start: Run: docker ps -q -a --filter label=devcontainer.local_folder=/Users/benwibking/quokka
[41399 ms] Start: Run: docker inspect --type image nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04
[41414 ms] Loading 6 extra certificates from /tmp/vsch/certificates-812b88c53c4680de95d0d10aa2d5e82da39186556a15b6d5a04bc6e202dbb1f6.pem.
[41808 ms] Request 'https://nvcr.io/v2/nvidia/nvhpc/manifests/24.9-devel-cuda12.6-ubuntu24.04' failed
[42011 ms] Request 'https://nvcr.io/v2/nvidia/nvhpc/manifests/24.9-devel-cuda12.6-ubuntu24.04' failed
[42011 ms] Error fetching image details: No manifest found for nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04.
[42012 ms] Start: Run: docker pull nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04
24.9-devel-cuda12.6-ubuntu24.04: Pulling from nvidia/nvhpc
6e59cb05818e: Pull complete 
61e6133d186d: Pull complete 
8f4445af7d02: Downloading    159MB/166.6MB

@BenWibking
Copy link
Collaborator

Maybe it's a trans-Pacific internet latency issue? Can you try again and see if it works?

@chongchonghe
Copy link
Contributor Author

Maybe it's a trans-Pacific internet latency issue? Can you try again and see if it works?

It's probably internet issue because the errors are not consistent among multiple tries.

If you tested it and it worked fine, you can approve this PR if you want. I'll do more tests in another time.

@BenWibking BenWibking marked this pull request as ready for review September 27, 2024 18:53
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Sep 27, 2024
BenWibking
BenWibking previously approved these changes Sep 27, 2024
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 27, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 27, 2024
@BenWibking
Copy link
Collaborator

BenWibking commented Sep 27, 2024

With nvc++, ScalarAdvectionSemiellipse fails due to a FPE trap:

 1/41 Test  #2: ScalarAdvectionSemiEllipse .......***Failed    3.49 sec
Initializing AMReX (24.09)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
AMReX (24.09) initialized
Erroneous arithmetic operation
See Backtrace.0 file for details

PrimordialChem also fails:

35/41 Test #41: PrimordialChem ...................***Failed    4.14 sec
Initializing AMReX (24.09)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
AMReX (24.09) initialized
Writing plotfile plt00000

[Warning] [Performance] The grid blocking factor (1) is too small for reasonable performance. It should be 32 (or greater) when running on GPUs, and 16 (or greater) when running on CPUs.

[Warning] [Performance] The maximum grid size (1) is too small for reasonable performance. It should be 128 (or greater) when running on GPUs, and 64 (or greater) when running on CPUs.

Coarse STEP 1 at t = 0 (0%) starts ...
Erroneous arithmetic operation
See Backtrace.0 file for details

This can be fixed later.

Copy link

@chongchonghe
Copy link
Contributor Author

@markkrumholz Can you approve this? This PR need your approval before it can be merged because both Ben and I have pushed.

@chongchonghe chongchonghe added this pull request to the merge queue Sep 29, 2024
Merged via the queue into development with commit c77f80c Sep 29, 2024
20 checks passed
@chongchonghe chongchonghe deleted the chong/dual-container branch October 6, 2024 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Dockerfile for CUDA
3 participants