Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU-enabled CI #170

Merged
merged 24 commits into from
Jan 30, 2025
Merged

Add GPU-enabled CI #170

merged 24 commits into from
Jan 30, 2025

Conversation

mattwthompson
Copy link
Member

Fixes #

Changes made in this Pull Request:

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.53%. Comparing base (350dcbd) to head (a6340d8).
Report is 3 commits behind head on main.

Additional details and impacted files

@mattwthompson
Copy link
Member Author

The default disk allocation isn't enough for this conda environment. I checked with @ethanholz and the runner action doesn't expose an option for that yet. There might be a release (0.5.0?) in a week or so that includes this among other changes.

There might also be a way to slim dependencies down, I have not not looked into this yet.

@mattwthompson
Copy link
Member Author

They made the release last Friday, upgrading to it is the next step here

@ethanholz
Copy link

As noted in our release notes, you will need to update the permissions on the AWS side to enable this functionality.

@mattwthompson
Copy link
Member Author

I think I need somebody (@j-wags?) to add me into our AWS infrastructure to enable me to do that

The doc page is here: https://github.com/omsf-eco-infra/gha-runner/blob/v0.4.0/docs/aws.md

And the relevant change is: omsf-eco-infra/gha-runner@8be6d03

@j-wags
Copy link
Member

j-wags commented Jan 17, 2025

I think I've updated this, could you let me know if it doesn't work?

@mattwthompson
Copy link
Member Author

Unfortunately doesn't seem to be working - do we need to update something in the YAML?

@mattwthompson mattwthompson mentioned this pull request Jan 17, 2025
@ethanholz
Copy link

Yes. As specified in the docs, you would need to choose a disk size larger than the default size of the AMI.

@mattwthompson
Copy link
Member Author

Ah thanks, forgot about that

@jameseastwood jameseastwood assigned lilyminium and unassigned j-wags Jan 17, 2025
@mattwthompson
Copy link
Member Author

@lilyminium something looks wrong with how Torch is set up here even though it's listed in the env, could you glance through briefly and see if something is obviously wrong?

@lilyminium
Copy link
Collaborator

I think that did it Matt 🎉 will the dgl channel have to keep getting updated?

@mattwthompson
Copy link
Member Author

Wow ... I contexted-switched without coming back to see if the failures were genuine.

I do think the dgl channel will need to be updated - but I can live with that?

@mattwthompson mattwthompson marked this pull request as ready for review January 29, 2025 22:48
Copy link
Collaborator

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions Matt:

  • all the other CI was deleted -- was that just to help debugging or are we planning to move all CI onto GPU?
  • I assume the intent is to run all tests on GPU, not just the one trial test?

@lilyminium
Copy link
Collaborator

Just took the liberty of reverting the additional test I added to check for CUDA availability.

@mattwthompson
Copy link
Member Author

all the other CI was deleted

just forgot to put it back

I assume the intent is to run all tests on GPU, not just the one trial test?

This was intentional but only because that test was tagged as requiring a GPU. I guess we could run the entire test suite? (Are there other tests that are GPU-only in your memory?)

@mattwthompson mattwthompson marked this pull request as draft January 30, 2025 12:50
Comment on lines -76 to -77
run: |
python -m pytest -n 4 -v --cov=openff/nagl --cov-config=setup.cfg --cov-append --cov-report=xml --color=yes openff/nagl/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small quirk: since setup.cfg exists, this won't throw an error (until that changes, anyway) but it won't pick up any actual configs as they're in pyproject.toml. A leftover detail from #169

@mattwthompson mattwthompson linked an issue Jan 30, 2025 that may be closed by this pull request
@mattwthompson
Copy link
Member Author

Not sure why the other CI was failing earlier ... re-kicked in case it was just a bad channel. If all checks (except docs) are green this is good to go in my view

@mattwthompson mattwthompson marked this pull request as ready for review January 30, 2025 23:02
@lilyminium
Copy link
Collaborator

(Are there other tests that are GPU-only in your memory?)

Not in the test suite, but there was a gpu-only bug previously (#81) so it would be handy to run all tests to safeguard against that in the future!

Copy link
Collaborator

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @mattwthompson for setting this up!!

@mattwthompson mattwthompson merged commit fbcf4f2 into main Jan 30, 2025
136 of 137 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GPU CI
5 participants