Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gpu): add shared nvidia boost clock logic #2014

Merged
merged 3 commits into from
Nov 12, 2024

Conversation

ndbaker1
Copy link
Member

@ndbaker1 ndbaker1 commented Oct 18, 2024

Issue #, if available:

Description of changes:

adding in default clocks rates for nvidia GPUs queries directly from nvidia-smi.
the logic is used for both al2 and al23.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done


build an al2023 ami using:

make k8s=1.31 ami_name=al23-nvidia-clocks-unit os_distro=al2023 enable_accelerator=nvidia enable_efa=true nvidia_driver_major_version=560

launched an g5.8xlarge and checked the status of the service:

[root@ip-172-31-6-36 bin]# systemctl status set-nvidia-clocks
● set-nvidia-clocks.service - Configure NVIDIA GPU clock rate
     Loaded: loaded (/etc/systemd/system/set-nvidia-clocks.service; enabled; preset: disabled)
     Active: active (exited) since Sat 2024-10-19 21:16:51 UTC; 1min 29s ago
   Main PID: 6301 (code=exited, status=0/SUCCESS)
        CPU: 164ms

Oct 19 21:16:50 ip-172-31-6-36.us-west-2.compute.internal sudo[6311]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/nvidia-smi --persistence-mode=1
Oct 19 21:16:50 ip-172-31-6-36.us-west-2.compute.internal sudo[6311]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 19 21:16:50 ip-172-31-6-36.us-west-2.compute.internal sudo[6311]: pam_unix(sudo:session): session closed for user root
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal sudo[6378]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/nvidia-smi --auto-boost-default=0
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal sudo[6378]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal sudo[6378]: pam_unix(sudo:session): session closed for user root
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal sudo[6381]:     root : PWD=/ ; USER=root ; COMMAND=/usr/bin/nvidia-smi --applications-clocks 6251,1710
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal sudo[6381]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=0)
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal sudo[6381]: pam_unix(sudo:session): session closed for user root
Oct 19 21:16:51 ip-172-31-6-36.us-west-2.compute.internal systemd[1]: Finished set-nvidia-clocks.service - Configure NVIDIA GPU clock rate.

built an al2 ami with:

make k8s=1.31 ami_name=al2-nvidia-clocks-unit os_distro=al2

launched an instance with the al2 ami and checked that the unit was not started:

[root@ip-172-31-6-246 bin]# systemctl status set-nvidia-clocks
● set-nvidia-clocks.service - Configure NVIDIA GPU clock rate
   Loaded: loaded (/etc/systemd/system/set-nvidia-clocks.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Mon 2024-10-21 04:30:15 UTC; 1min 32s ago

Oct 21 04:30:14 localhost systemd[1]: Cannot add dependency job for unit set-nvidia-clocks.service, ignoring: Unit not found.

since this actually failed to start due to missing nvidia-persistence, i removed the After and Requires clauses to simulate the FilePath condition:

[root@ip-172-31-6-246 bin]# systemctl status set-nvidia-clocks
● set-nvidia-clocks.service - Configure NVIDIA GPU clock rate
   Loaded: loaded (/etc/systemd/system/set-nvidia-clocks.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
Condition: start condition failed at Mon 2024-10-21 04:35:42 UTC; 1s ago
           ConditionPathExists=/usr/bin/nvidia-smi was not met

Oct 21 04:30:14 localhost systemd[1]: Cannot add dependency job for unit set-nvidia-clocks.service, ignoring: Unit not found.

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@ndbaker1 ndbaker1 marked this pull request as ready for review October 18, 2024 20:20
@ndbaker1 ndbaker1 changed the title feat: add shared nvidia boost clock logic feat(gpu): add shared nvidia boost clock logic Oct 21, 2024
After=nvidia-persistenced.service
Requires=nvidia-persistenced.service

ConditionPathExists=/usr/bin/nvidia-smi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loooove that

@cartermckinnon
Copy link
Member

/ci

Copy link
Contributor

github-actions bot commented Nov 9, 2024

@cartermckinnon roger that! I've dispatched a workflow. 👍

Copy link
Contributor

github-actions bot commented Nov 9, 2024

@cartermckinnon the workflow that you requested has completed. 🎉

AMI variantBuildTest
1.24 / al2success ✅success ✅
1.24 / al2023success ✅success ✅
1.25 / al2success ✅success ✅
1.25 / al2023success ✅success ✅
1.26 / al2success ✅success ✅
1.26 / al2023success ✅success ✅
1.27 / al2success ✅success ✅
1.27 / al2023success ✅success ✅
1.28 / al2success ✅success ✅
1.28 / al2023success ✅success ✅
1.29 / al2success ✅success ✅
1.29 / al2023success ✅success ✅
1.30 / al2success ✅success ✅
1.30 / al2023success ✅success ✅
1.31 / al2success ✅success ✅
1.31 / al2023success ✅success ✅

@cartermckinnon cartermckinnon merged commit 6b5a944 into awslabs:main Nov 12, 2024
10 checks passed
@ndbaker1 ndbaker1 deleted the gpu-clocks branch November 12, 2024 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants