Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU CI/CD #253

Merged
merged 34 commits into from
Oct 9, 2024
Merged

Add GPU CI/CD #253

merged 34 commits into from
Oct 9, 2024

Conversation

sarahyurick
Copy link
Collaborator

@sarahyurick sarahyurick commented Sep 18, 2024

This PR enables gpuCI for our PyTests marked with @pytest.mark.gpu.

To trigger it, a user has to add the gpuci label to their PR for the GPU tests to run. Only users with write access are able to add labels to PRs.

If more commits are added to the PR, the gpuci label has to be removed and re-added to run the GPU tests again. This is for security reasons; for example, a random user could open a PR, a user with write access adds the gpuci label, the GPU tests run safely, and then the user pushes malicious code. As long as no one with write access re-adds the gpuci label, it won't run at all.

Copy link
Collaborator Author

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix trailing whitespaces in gpuci.yml and should make the spacing in auto-label.yml consistent with gpuci.yml.

Also, maybe I should rename it to auto-label-gpuci.yml.


- name: Run PyTests with GPU mark
run: |
python -m pytest -m gpu
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run our GPU CI tests only on PyTests with the pytest.mark.gpu decorator. We already have quite a few of them in our repository:

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Sep 19, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Oct 7, 2024
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Oct 7, 2024
Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great and helps us catch issues much earlier! Thank you for pushing through this!

ayushdg
ayushdg previously requested changes Oct 7, 2024
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions

@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Oct 7, 2024
@sarahyurick
Copy link
Collaborator Author

@ko3n1g is disconnecting our runner for a bit. Should be back up tomorrow, then I can rerun the checks and merge.

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Oct 8, 2024
Copy link
Collaborator

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important since it impacts other projects that use this infra

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Oct 9, 2024
@sarahyurick sarahyurick dismissed ayushdg’s stale review October 9, 2024 21:38

Addressed review

@sarahyurick sarahyurick merged commit 0bbdc06 into NVIDIA:main Oct 9, 2024
5 checks passed
@sarahyurick sarahyurick deleted the gpuci branch October 25, 2024 20:45
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 12, 2024
* add yaml files to gh workflows

Signed-off-by: Sarah Yurick <[email protected]>

* edit spacing

Signed-off-by: Sarah Yurick <[email protected]>

* no cache dir

Signed-off-by: Sarah Yurick <[email protected]>

* cmake

Signed-off-by: Sarah Yurick <[email protected]>

* fasttext wheel

Signed-off-by: Sarah Yurick <[email protected]>

* python3 dev

Signed-off-by: Sarah Yurick <[email protected]>

* get update

Signed-off-by: Sarah Yurick <[email protected]>

* c installs

Signed-off-by: Sarah Yurick <[email protected]>

* setuptools pip upgrade

Signed-off-by: Sarah Yurick <[email protected]>

* use stable rapids

Signed-off-by: Sarah Yurick <[email protected]>

* remove wheel see what happens

Signed-off-by: Sarah Yurick <[email protected]>

* edit readme and remove autolabel for now

Signed-off-by: Sarah Yurick <[email protected]>

* add container logic

Signed-off-by: Sarah Yurick <[email protected]>

* add dockerfile and oliver's other suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* fix run format

Signed-off-by: Sarah Yurick <[email protected]>

* forked repo url

Signed-off-by: Sarah Yurick <[email protected]>

* docker run with all gpus

Signed-off-by: Sarah Yurick <[email protected]>

* remove running container

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/gpuci.yml

Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>

* re add test

Signed-off-by: Sarah Yurick <[email protected]>

* debug attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove it

Signed-off-by: Sarah Yurick <[email protected]>

* add library path

Signed-off-by: Sarah Yurick <[email protected]>

* remove nvcc check

Signed-off-by: Sarah Yurick <[email protected]>

* more debugging

Signed-off-by: Sarah Yurick <[email protected]>

* specify curator dir

Signed-off-by: Sarah Yurick <[email protected]>

* more debugging

Signed-off-by: Sarah Yurick <[email protected]>

* try pytorch container

Signed-off-by: Sarah Yurick <[email protected]>

* use rapids container

Signed-off-by: Sarah Yurick <[email protected]>

* fix RUN instructions

Signed-off-by: Sarah Yurick <[email protected]>

* add comments and review suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* update runners

Signed-off-by: Sarah Yurick <[email protected]>

* move args

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 13, 2024
* add yaml files to gh workflows

Signed-off-by: Sarah Yurick <[email protected]>

* edit spacing

Signed-off-by: Sarah Yurick <[email protected]>

* no cache dir

Signed-off-by: Sarah Yurick <[email protected]>

* cmake

Signed-off-by: Sarah Yurick <[email protected]>

* fasttext wheel

Signed-off-by: Sarah Yurick <[email protected]>

* python3 dev

Signed-off-by: Sarah Yurick <[email protected]>

* get update

Signed-off-by: Sarah Yurick <[email protected]>

* c installs

Signed-off-by: Sarah Yurick <[email protected]>

* setuptools pip upgrade

Signed-off-by: Sarah Yurick <[email protected]>

* use stable rapids

Signed-off-by: Sarah Yurick <[email protected]>

* remove wheel see what happens

Signed-off-by: Sarah Yurick <[email protected]>

* edit readme and remove autolabel for now

Signed-off-by: Sarah Yurick <[email protected]>

* add container logic

Signed-off-by: Sarah Yurick <[email protected]>

* add dockerfile and oliver's other suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* fix run format

Signed-off-by: Sarah Yurick <[email protected]>

* forked repo url

Signed-off-by: Sarah Yurick <[email protected]>

* docker run with all gpus

Signed-off-by: Sarah Yurick <[email protected]>

* remove running container

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/gpuci.yml

Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>

* re add test

Signed-off-by: Sarah Yurick <[email protected]>

* debug attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove it

Signed-off-by: Sarah Yurick <[email protected]>

* add library path

Signed-off-by: Sarah Yurick <[email protected]>

* remove nvcc check

Signed-off-by: Sarah Yurick <[email protected]>

* more debugging

Signed-off-by: Sarah Yurick <[email protected]>

* specify curator dir

Signed-off-by: Sarah Yurick <[email protected]>

* more debugging

Signed-off-by: Sarah Yurick <[email protected]>

* try pytorch container

Signed-off-by: Sarah Yurick <[email protected]>

* use rapids container

Signed-off-by: Sarah Yurick <[email protected]>

* fix RUN instructions

Signed-off-by: Sarah Yurick <[email protected]>

* add comments and review suggestions

Signed-off-by: Sarah Yurick <[email protected]>

* update runners

Signed-off-by: Sarah Yurick <[email protected]>

* move args

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants