Add a CUDA devcontainer #751

chongchonghe · 2024-09-25T08:09:58Z

Description

Adds a new CUDA devcontainer. Now you can choose from gcc-container and cuda-container.

When using the default compiler and MPI library in the CUDA container, it is necessary to add these options to the CMake configuration flags when building Quokka:

$ export PATH=$PATH:/workspaces/quokka/scripts
$ cmake .. -DCMAKE_TEST_LAUNCHER=mpirun-singleton-wrapper -DDISABLE_FMAD=OFF

For details on CMAKE_TEST_LAUNCHER, see:

Related issues

Closes #745.

Checklist

Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an x inside the square brackets [ ] in the Markdown source below:

I have added a description (see above).
I have added a link to any related issues see (see above).
I have read the Contributing Guide.
I have added tests for any new physics that this PR adds to the code.
I have tested this PR on my local computer and all tests pass.
I have manually triggered the GPU tests with the magic comment /azp run.
I have requested a reviewer for this PR.

BenWibking · 2024-09-25T14:01:22Z

The issue is that Python is not installed. extern_parameters.H is generated by a Python script inside Microphysics.

chongchonghe · 2024-09-26T01:01:21Z

The issue is that Python is not installed. extern_parameters.H is generated by a Python script inside Microphysics.

I see. I thought since we are not using matplotlib in GPU runs, we don't need to install Python.

BenWibking · 2024-09-26T01:11:48Z

The issue is that Python is not installed. extern_parameters.H is generated by a Python script inside Microphysics.

I see. I thought since we are not using matplotlib in GPU runs, we don't need to install Python.

No, unfortunately Python is a required build-time dependency now, since it is required by Microphysics.

BenWibking · 2024-09-26T01:12:53Z

We should probably update the README to note that Python is required, rather than optional.

BenWibking · 2024-09-26T18:00:58Z

I've created a PR to update the README: #753

.devcontainer/cuda-container/Dockerfile

chongchonghe · 2024-09-27T02:52:28Z

@BenWibking did you make it work? I got the following error:

------
Dockerfile:2
--------------------
   1 |     # Use the NVIDIA CUDA image as the base image
   2 | >>> FROM nvcr.io/nvidia/nvhpc:24.7-devel-cuda12.5-ubuntu22.04
   3 |     # FROM nvcr.io/nvidia/nvhpc:24.9-runtime-cuda11.8-ubuntu22.04
   4 |
--------------------
ERROR: failed to solve: nvcr.io/nvidia/nvhpc:24.7-devel-cuda12.5-ubuntu22.04: failed to resolve source metadata for nvcr.io/nvidia/nvhpc:24.7-devel-cuda12.5-ubuntu22.04: failed to do request: Head "https://nvcr.io/v2/nvidia/nvhpc/manifests/24.7-devel-cuda12.5-ubuntu22.04": net/http: TLS handshake timeout

I also tried a few other images on this page and got the same error.

BenWibking · 2024-09-27T16:39:19Z

The full list is here: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc/tags.

I am able to download the nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04 container:

[41382 ms] Start: Run: docker ps -q -a --filter label=devcontainer.local_folder=/Users/benwibking/quokka
[41399 ms] Start: Run: docker inspect --type image nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04
[41414 ms] Loading 6 extra certificates from /tmp/vsch/certificates-812b88c53c4680de95d0d10aa2d5e82da39186556a15b6d5a04bc6e202dbb1f6.pem.
[41808 ms] Request 'https://nvcr.io/v2/nvidia/nvhpc/manifests/24.9-devel-cuda12.6-ubuntu24.04' failed
[42011 ms] Request 'https://nvcr.io/v2/nvidia/nvhpc/manifests/24.9-devel-cuda12.6-ubuntu24.04' failed
[42011 ms] Error fetching image details: No manifest found for nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04.
[42012 ms] Start: Run: docker pull nvcr.io/nvidia/nvhpc:24.9-devel-cuda12.6-ubuntu24.04
24.9-devel-cuda12.6-ubuntu24.04: Pulling from nvidia/nvhpc
6e59cb05818e: Pull complete 
61e6133d186d: Pull complete 
8f4445af7d02: Downloading    159MB/166.6MB

BenWibking · 2024-09-27T16:39:57Z

Maybe it's a trans-Pacific internet latency issue? Can you try again and see if it works?

chongchonghe · 2024-09-27T18:29:24Z

Maybe it's a trans-Pacific internet latency issue? Can you try again and see if it works?

It's probably internet issue because the errors are not consistent among multiple tries.

If you tested it and it worked fine, you can approve this PR if you want. I'll do more tests in another time.

BenWibking · 2024-09-27T20:14:37Z

With nvc++, ScalarAdvectionSemiellipse fails due to a FPE trap:

 1/41 Test  #2: ScalarAdvectionSemiEllipse .......***Failed    3.49 sec
Initializing AMReX (24.09)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
AMReX (24.09) initialized
Erroneous arithmetic operation
See Backtrace.0 file for details

PrimordialChem also fails:

35/41 Test #41: PrimordialChem ...................***Failed    4.14 sec
Initializing AMReX (24.09)...
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
AMReX (24.09) initialized
Writing plotfile plt00000

[Warning] [Performance] The grid blocking factor (1) is too small for reasonable performance. It should be 32 (or greater) when running on GPUs, and 16 (or greater) when running on CPUs.

[Warning] [Performance] The maximum grid size (1) is too small for reasonable performance. It should be 128 (or greater) when running on GPUs, and 64 (or greater) when running on CPUs.

Coarse STEP 1 at t = 0 (0%) starts ...
Erroneous arithmetic operation
See Backtrace.0 file for details

This can be fixed later.

sonarqubecloud · 2024-09-27T20:18:43Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

chongchonghe · 2024-09-29T03:22:49Z

@markkrumholz Can you approve this? This PR need your approval before it can be merged because both Ben and I have pushed.

chongchonghe added 2 commits September 25, 2024 15:24

first commit from MBA

abf432e

remove python in Dockerfile for cuda

82fe536

add python dep

7120bc5

BenWibking reviewed Sep 27, 2024

View reviewed changes

.devcontainer/cuda-container/Dockerfile Outdated Show resolved Hide resolved

BenWibking added 3 commits September 26, 2024 21:07

Update Dockerfile

16b3f03

use python3-defaults

ab2b41b

update

4dfe9f4

BenWibking reviewed Sep 27, 2024

View reviewed changes

.devcontainer/cuda-container/Dockerfile Outdated Show resolved Hide resolved

BenWibking added 2 commits September 27, 2024 16:35

update base image

53ccf96

update

2af9785

BenWibking added 5 commits September 27, 2024 14:12

update llvm apt install

d5f6979

install curl

51bcbc6

install all python deps

ce5fed6

user ubuntu already exists

0a573c5

add symlink

287bb66

BenWibking added 2 commits September 27, 2024 18:38

workaround FMA flags for NVHPC

414aa5b

remove Dockerfile comment

e007d4c

BenWibking marked this pull request as ready for review September 27, 2024 18:53

BenWibking requested a review from markkrumholz as a code owner September 27, 2024 18:53

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Sep 27, 2024

BenWibking previously approved these changes Sep 27, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 27, 2024

BenWibking added 2 commits September 27, 2024 19:28

add cmake apt install

436a871

remove sudo

7d54d71

BenWibking dismissed their stale review via 7d54d71 September 27, 2024 19:33

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 27, 2024

add mpirun wrapper script

0f386b9

Merge branch 'development' into chong/dual-container

097d4c1

BenWibking approved these changes Sep 27, 2024

View reviewed changes

chongchonghe enabled auto-merge September 28, 2024 05:00

markkrumholz approved these changes Sep 29, 2024

View reviewed changes

chongchonghe added this pull request to the merge queue Sep 29, 2024

Merged via the queue into development with commit c77f80c Sep 29, 2024
20 checks passed

chongchonghe deleted the chong/dual-container branch October 6, 2024 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CUDA devcontainer #751

Add a CUDA devcontainer #751

chongchonghe commented Sep 25, 2024 •

edited by BenWibking

Loading

BenWibking commented Sep 25, 2024

chongchonghe commented Sep 26, 2024

BenWibking commented Sep 26, 2024

BenWibking commented Sep 26, 2024

BenWibking commented Sep 26, 2024

chongchonghe commented Sep 27, 2024

BenWibking commented Sep 27, 2024

BenWibking commented Sep 27, 2024

chongchonghe commented Sep 27, 2024

BenWibking commented Sep 27, 2024 •

edited

Loading

sonarqubecloud bot commented Sep 27, 2024

chongchonghe commented Sep 29, 2024

Add a CUDA devcontainer #751

Add a CUDA devcontainer #751

Conversation

chongchonghe commented Sep 25, 2024 • edited by BenWibking Loading

Description

Related issues

Checklist

BenWibking commented Sep 25, 2024

chongchonghe commented Sep 26, 2024

BenWibking commented Sep 26, 2024

BenWibking commented Sep 26, 2024

BenWibking commented Sep 26, 2024

chongchonghe commented Sep 27, 2024

BenWibking commented Sep 27, 2024

BenWibking commented Sep 27, 2024

chongchonghe commented Sep 27, 2024

BenWibking commented Sep 27, 2024 • edited Loading

sonarqubecloud bot commented Sep 27, 2024

Quality Gate passed

chongchonghe commented Sep 29, 2024

chongchonghe commented Sep 25, 2024 •

edited by BenWibking

Loading

BenWibking commented Sep 27, 2024 •

edited

Loading