-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU-enabled CI #170
Add GPU-enabled CI #170
Conversation
The default disk allocation isn't enough for this conda environment. I checked with @ethanholz and the runner action doesn't expose an option for that yet. There might be a release (0.5.0?) in a week or so that includes this among other changes. There might also be a way to slim dependencies down, I have not not looked into this yet. |
They made the release last Friday, upgrading to it is the next step here |
As noted in our release notes, you will need to update the permissions on the AWS side to enable this functionality. |
I think I need somebody (@j-wags?) to add me into our AWS infrastructure to enable me to do that The doc page is here: https://github.com/omsf-eco-infra/gha-runner/blob/v0.4.0/docs/aws.md And the relevant change is: omsf-eco-infra/gha-runner@8be6d03 |
I think I've updated this, could you let me know if it doesn't work? |
Unfortunately doesn't seem to be working - do we need to update something in the YAML? |
Yes. As specified in the docs, you would need to choose a disk size larger than the default size of the AMI. |
Ah thanks, forgot about that |
@lilyminium something looks wrong with how Torch is set up here even though it's listed in the env, could you glance through briefly and see if something is obviously wrong? |
I think that did it Matt 🎉 will the dgl channel have to keep getting updated? |
Wow ... I contexted-switched without coming back to see if the failures were genuine. I do think the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions Matt:
- all the other CI was deleted -- was that just to help debugging or are we planning to move all CI onto GPU?
- I assume the intent is to run all tests on GPU, not just the one trial test?
This reverts commit 6c5f42c.
Just took the liberty of reverting the additional test I added to check for CUDA availability. |
just forgot to put it back
This was intentional but only because that test was tagged as requiring a GPU. I guess we could run the entire test suite? (Are there other tests that are GPU-only in your memory?) |
run: | | ||
python -m pytest -n 4 -v --cov=openff/nagl --cov-config=setup.cfg --cov-append --cov-report=xml --color=yes openff/nagl/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small quirk: since setup.cfg
exists, this won't throw an error (until that changes, anyway) but it won't pick up any actual configs as they're in pyproject.toml
. A leftover detail from #169
Not sure why the other CI was failing earlier ... re-kicked in case it was just a bad channel. If all checks (except docs) are green this is good to go in my view |
Not in the test suite, but there was a gpu-only bug previously (#81) so it would be handy to run all tests to safeguard against that in the future! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thanks @mattwthompson for setting this up!!
Fixes #
Changes made in this Pull Request:
PR Checklist