Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update A3 mega blueprint to use Slurm-GCP 6.5.12 #2763

Merged

Conversation

tpdownes
Copy link
Member

@tpdownes tpdownes commented Jul 16, 2024

This PR updates the A3 Mega slurm image building solution in 2 ways:

  • it will build from the latest tag of GoogleCloudPlatform/slurm-gcp
  • it will "hold" several google custom packages at their version in the base Debian-12 image

The first change aids alignment of Terraform code with the Python autoscaling code added during image building. We previously (#2746) re-enabled enable_devel: true which should avoid any problems, but alignment is good.

The second change is critical to address problems introduced by GoogleCloudPlatform/guest-agent#401, which perfectly describes the A3 Mega solution: an ethernet device naming problem stemming from building an image on a Packer VM without local SSD and then provisioning A3 Mega VMs (with local SSD) using the image. The resulting compute nodes that cannot join the network or the cluster.

Before the PR, we observe <10% of nodes joining initially and rarely managing to succeed after every ~1000s retry cycle. After the PR, we see 100% of nodes joining on the first try.

Future work can be done to update the base debian-12 image (from April 2024) to one that includes the fix to GoogleCloudPlatform/guest-agent#401.

Submission Checklist

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cloud HPC Toolkit Contribution guidelines #

@tpdownes tpdownes self-assigned this Jul 16, 2024
@tpdownes tpdownes added the release-improvements Added to release notes under the "Improvements" heading. label Jul 16, 2024
@tpdownes tpdownes force-pushed the a3mega_slurm_update branch from ce83dfb to 7b286e7 Compare July 17, 2024 17:58
@tpdownes tpdownes requested a review from harshthakkar01 July 17, 2024 18:34
@tpdownes tpdownes assigned harshthakkar01 and unassigned tpdownes Jul 17, 2024
@tpdownes tpdownes marked this pull request as ready for review July 17, 2024 18:34
tpdownes added 2 commits July 17, 2024 18:42
Fix the versions for local google guest VM services so that they do not
upgrade to versions that are known to have boot-time issues for the
following combination:

- building image using Packer on a build VM without local NVME devices
- final image used on a VM with local NVME devices

In this combination, network configurations persist that do not match
the final naming conventions of the network interfaces because of
differing PCI bus layout.
@tpdownes tpdownes force-pushed the a3mega_slurm_update branch from 7b286e7 to bc974c8 Compare July 17, 2024 18:42
@tpdownes tpdownes changed the base branch from develop to release-candidate July 17, 2024 18:42
@tpdownes tpdownes enabled auto-merge July 17, 2024 19:05
@harshthakkar01
Copy link
Contributor

Ran NCCL tests and verified performance is as per our expectations.

@tpdownes tpdownes merged commit f537091 into GoogleCloudPlatform:release-candidate Jul 17, 2024
10 of 51 checks passed
@tpdownes tpdownes deleted the a3mega_slurm_update branch July 17, 2024 19:50
@ankitkinra ankitkinra mentioned this pull request Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants