Add GPU-enabled CI #170

mattwthompson · 2025-01-08T20:37:48Z

Fixes #

Changes made in this Pull Request:

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

codecov-commenter · 2025-01-08T20:55:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.53%. Comparing base (350dcbd) to head (a6340d8).
Report is 3 commits behind head on main.

Additional details and impacted files

mattwthompson · 2025-01-09T17:02:55Z

The default disk allocation isn't enough for this conda environment. I checked with @ethanholz and the runner action doesn't expose an option for that yet. There might be a release (0.5.0?) in a week or so that includes this among other changes.

There might also be a way to slim dependencies down, I have not not looked into this yet.

mattwthompson · 2025-01-13T20:01:21Z

They made the release last Friday, upgrading to it is the next step here

ethanholz · 2025-01-13T20:14:31Z

As noted in our release notes, you will need to update the permissions on the AWS side to enable this functionality.

mattwthompson · 2025-01-15T21:18:00Z

I think I need somebody (@j-wags?) to add me into our AWS infrastructure to enable me to do that

The doc page is here: https://github.com/omsf-eco-infra/gha-runner/blob/v0.4.0/docs/aws.md

And the relevant change is: omsf-eco-infra/gha-runner@8be6d03

j-wags · 2025-01-17T01:04:12Z

I think I've updated this, could you let me know if it doesn't work?

mattwthompson · 2025-01-17T17:41:48Z

Unfortunately doesn't seem to be working - do we need to update something in the YAML?

ethanholz · 2025-01-17T18:10:07Z

Yes. As specified in the docs, you would need to choose a disk size larger than the default size of the AMI.

mattwthompson · 2025-01-17T19:44:01Z

Ah thanks, forgot about that

mattwthompson · 2025-01-17T22:52:21Z

@lilyminium something looks wrong with how Torch is set up here even though it's listed in the env, could you glance through briefly and see if something is obviously wrong?

lilyminium · 2025-01-29T22:46:43Z

I think that did it Matt 🎉 will the dgl channel have to keep getting updated?

mattwthompson · 2025-01-29T22:48:36Z

Wow ... I contexted-switched without coming back to see if the failures were genuine.

I do think the dgl channel will need to be updated - but I can live with that?

lilyminium

A few questions Matt:

all the other CI was deleted -- was that just to help debugging or are we planning to move all CI onto GPU?
I assume the intent is to run all tests on GPU, not just the one trial test?

This reverts commit 6c5f42c.

lilyminium · 2025-01-30T06:05:00Z

Just took the liberty of reverting the additional test I added to check for CUDA availability.

mattwthompson · 2025-01-30T12:27:57Z

all the other CI was deleted

just forgot to put it back

I assume the intent is to run all tests on GPU, not just the one trial test?

This was intentional but only because that test was tagged as requiring a GPU. I guess we could run the entire test suite? (Are there other tests that are GPU-only in your memory?)

mattwthompson · 2025-01-30T14:18:05Z

.github/workflows/base-ci.yaml

-      run: |
-        python -m pytest -n 4 -v --cov=openff/nagl --cov-config=setup.cfg --cov-append --cov-report=xml --color=yes openff/nagl/


Small quirk: since setup.cfg exists, this won't throw an error (until that changes, anyway) but it won't pick up any actual configs as they're in pyproject.toml. A leftover detail from #169

mattwthompson · 2025-01-30T23:01:42Z

Not sure why the other CI was failing earlier ... re-kicked in case it was just a bad channel. If all checks (except docs) are green this is good to go in my view

lilyminium · 2025-01-30T23:11:34Z

(Are there other tests that are GPU-only in your memory?)

Not in the test suite, but there was a gpu-only bug previously (#81) so it would be handy to run all tests to safeguard against that in the future!

lilyminium

LGTM -- thanks @mattwthompson for setting this up!!

Add GPU-enabled CI

c428d42

mattwthompson added 2 commits January 8, 2025 15:03

Update file reference

7f5caaf

Remove other CI

ac1d711

ethanholz mentioned this pull request Jan 9, 2025

Add support to expand root block device for AMI omsf-eco-infra/gha-runner#45

Closed

j-wags assigned mattwthompson Jan 13, 2025

mattwthompson added 2 commits January 15, 2025 14:54

Bump runner version

dbf95db

Merge remote-tracking branch 'upstream/main' into gpu-ci

92f87da

jameseastwood assigned j-wags Jan 15, 2025

mattwthompson mentioned this pull request Jan 17, 2025

Release 0.5.1? #173

Closed

mattwthompson added 5 commits January 17, 2025 13:53

Try allocating 10 GB

ec635c2

Debug

4264306

Debug

efdc5c3

Sync CUDA environment with DGL environment

4952631

Debug Torch/CUDA interaction

d166fcd

jameseastwood assigned lilyminium and unassigned j-wags Jan 17, 2025

mattwthompson and others added 4 commits January 22, 2025 16:53

Try adding pytorch-gpu

5572bc9

Debug

37a4ee8

tmp add print and trial test

6c5f42c

check dgl

a9b0925

lilyminium and others added 4 commits January 29, 2025 18:05

add torchdata package

a74c187

add other torch packages required

7539846

Try bumping to newer DGL channel targeting PyTorch 2.1

8d15964

Add back pytorch-gpu?

b14930d

mattwthompson marked this pull request as ready for review January 29, 2025 22:48

jameseastwood unassigned mattwthompson Jan 29, 2025

lilyminium requested changes Jan 30, 2025

View reviewed changes

Revert "tmp add print and trial test"

f439845

This reverts commit 6c5f42c.

Revert more temporary changes, fix coverage

d4456db

mattwthompson marked this pull request as draft January 30, 2025 12:50

mattwthompson added 3 commits January 30, 2025 06:54

Syntax

1a4bdb5

Debug

2cf2218

Debug

0711c62

mattwthompson commented Jan 30, 2025

View reviewed changes

Fix

a6340d8

mattwthompson linked an issue Jan 30, 2025 that may be closed by this pull request

Add GPU CI #166

Closed

mattwthompson marked this pull request as ready for review January 30, 2025 23:02

lilyminium approved these changes Jan 30, 2025

View reviewed changes

mattwthompson merged commit fbcf4f2 into main Jan 30, 2025
136 of 137 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU-enabled CI #170

Add GPU-enabled CI #170

mattwthompson commented Jan 8, 2025

codecov-commenter commented Jan 8, 2025 •

edited

Loading

mattwthompson commented Jan 9, 2025

mattwthompson commented Jan 13, 2025

ethanholz commented Jan 13, 2025

mattwthompson commented Jan 15, 2025

j-wags commented Jan 17, 2025

mattwthompson commented Jan 17, 2025

ethanholz commented Jan 17, 2025

mattwthompson commented Jan 17, 2025

mattwthompson commented Jan 17, 2025

lilyminium commented Jan 29, 2025

mattwthompson commented Jan 29, 2025

lilyminium left a comment

lilyminium commented Jan 30, 2025

mattwthompson commented Jan 30, 2025

mattwthompson Jan 30, 2025

mattwthompson commented Jan 30, 2025

lilyminium commented Jan 30, 2025

lilyminium left a comment

		run: \|
		python -m pytest -n 4 -v --cov=openff/nagl --cov-config=setup.cfg --cov-append --cov-report=xml --color=yes openff/nagl/

Add GPU-enabled CI #170

Add GPU-enabled CI #170

Conversation

mattwthompson commented Jan 8, 2025

PR Checklist

codecov-commenter commented Jan 8, 2025 • edited Loading

Codecov Report

mattwthompson commented Jan 9, 2025

mattwthompson commented Jan 13, 2025

ethanholz commented Jan 13, 2025

mattwthompson commented Jan 15, 2025

j-wags commented Jan 17, 2025

mattwthompson commented Jan 17, 2025

ethanholz commented Jan 17, 2025

mattwthompson commented Jan 17, 2025

mattwthompson commented Jan 17, 2025

lilyminium commented Jan 29, 2025

mattwthompson commented Jan 29, 2025

lilyminium left a comment

Choose a reason for hiding this comment

lilyminium commented Jan 30, 2025

mattwthompson commented Jan 30, 2025

mattwthompson Jan 30, 2025

Choose a reason for hiding this comment

mattwthompson commented Jan 30, 2025

lilyminium commented Jan 30, 2025

lilyminium left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 8, 2025 •

edited

Loading