Fix GHA Linux GPU Job not running tests on CUDA #6957

vfdev-5 · 2022-11-16T14:16:18Z

Currently "Unit-tests on Linux GPU" is not running cuda tests but supposed to.

For example, on main:

https://github.com/pytorch/vision/actions/runs/3471795920/jobs/5801756501#step:8:17508

test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center1-32-26-cuda] SKIPPED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-7-33-cpu] PASSED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-7-33-cuda] SKIPPED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-26-26-cpu] PASSED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-26-26-cuda] SKIPPED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-32-26-cpu] PASSED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-32-26-cuda] SKIPPED [ 53%]

Context: #6804 (comment)

cc @seemethere

The text was updated successfully, but these errors were encountered:

pmeier · 2022-11-22T14:17:38Z

The issue is twofold.

CUDA setup seems to be wrong. At least torch.cuda.is_available() is False. I've looked into it and it seems the Nvidia driver is either not installed or not setup correctly. See skip CPU tests on GPU GHA jobs #6970 (comment). @osalpekar could you have a look. My suggestion is to include an python -c "import torch; exit(not torch.cuda.is_available())" after the torch installation to make sure everything is in order.

Our current GPU tests are somewhat "hardcoded" to CircleCI:

vision/test/conftest.py

Lines 15 to 24 in 4a310f2

    
           def pytest_collection_modifyitems(items): 
        
               # This hook is called by pytest after it has collected the tests (google its name to check out its doc!) 
        
               # We can ignore some tests as we see fit here, or add marks, such as a skip mark. 
        
               # 
        
               # Typically here, we try to optimize CI time. In particular, the GPU CI instances don't need to run the 
        
               # tests that don't need CUDA, because those tests are extensively tested in the CPU CI instances already. 
        
               # This is true for both CircleCI and the fbcode internal CI. 
        
               # In the fbcode CI, we have an additional constraint: we try to avoid skipping tests. So instead of relying on 
        
               # pytest.mark.skip, in fbcode we literally just remove those tests from the `items` list, and it's as if 
        
               # these tests never existed.

The tests should still be run, but the CPU ones will not be skipped. After 1. above is resolved, skip CPU tests on GPU GHA jobs #6970 resolves this.

osalpekar · 2022-12-07T23:56:31Z

@pmeier @vfdev-5 I Investigated this issue a bit. I see that torch.cuda.is_available() is True. I'm also able to create a torch tensor on the GPU and check its device affinity. Also able to print the output of nvidia-smi. Here is the link to the job for reference: https://github.com/pytorch/test-infra/actions/runs/3643372591/jobs/6151532786. Perhaps something changed in the linux job or our build containers under the hood in the past few weeks. Shall we revisit the fix for #2 now that #1 appears to be working?

vfdev-5 · 2022-12-08T00:11:13Z

@osalpekar thanks for investigations! Which PRs you are mentioning, the ones in your message (#1 and #2) are probably unrelated ?
I restarted GPU job for transforms v2 : https://github.com/pytorch/vision/actions/runs/3643938504/jobs/6152659453 , https://github.com/pytorch/vision/actions/runs/3644078231/jobs/6152944170, so let's see if it is working now.

osalpekar · 2022-12-08T00:21:07Z

@vfdev-5 Ah good point! By # 1 and # 2, I was referring to @pmeier's two points in the previous comment. Thanks for kicking off that job. Would love to see if #6970 is unblocked now that the cuda drivers seem to be working

vfdev-5 · 2022-12-08T01:00:13Z

@osalpekar the job is still failing with missing cuda, https://github.com/pytorch/vision/actions/runs/3644078231/jobs/6152944170#logs

2022-12-08T00:48:14.8403566Z + python3 -c 'import torch; exit(not torch.cuda.is_available())'
2022-12-08T00:48:16.1029960Z ##[error]Process completed with exit code 1.
2022-12-08T00:48:16.1054680Z Prepare all required actions

osalpekar · 2022-12-08T01:38:09Z

@vfdev-5 A couple differences I see in the jobs working correctly and this one:

Looks like we're using slightly different conda installation commands. Our build pipelines that use the same container for cuda11.6 install as such conda install -v -y -c pytorch-"${CHANNEL}" -c nvidia pytorch pytorch-cuda="${GPU_ARCH_VERSION}". Based on the logs it does seem like CUDA-enabled pytorch is installed, but the logs are a bit hard to pass because of the many-thousand line output from conda install.
Our test workflows use pip to install torch nightly instead of conda.
The instance type is different - the jobs that are working use linux.4xlarge.nvidia.gpu whic is a g3 instance as opposed to the g5 instances used here.

For points 2 and 3, these are mostly just observations, ideally conda installing on any instance in the pool should work fine. The first suggestion might be worth looking into.

pmeier · 2022-12-08T12:35:59Z

@osalpekar The issue persists for #6970: https://github.com/pytorch/vision/actions/runs/3648214763/jobs/6161369227

Edit: it seems I misunderstood. They issue should be gone after #7019.

vfdev-5 · 2023-01-20T11:35:50Z

I think this is not relevant anymore. We can close this issue

vfdev-5 assigned osalpekar Nov 16, 2022

vfdev-5 added the module: ci label Nov 16, 2022

This was referenced Nov 22, 2022

[Nova] Disable CircleCI Linux CPU Unittests #6968

Merged

skip CPU tests on GPU GHA jobs #6970

Merged

vfdev-5 mentioned this issue Dec 8, 2022

[proto][ci] Try add GPU ci for prototype transforms #6919

Merged

vfdev-5 closed this as completed Jan 20, 2023

pmeier mentioned this issue Feb 27, 2023

remove Linux GPU unittest from CircleCI #7354

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GHA Linux GPU Job not running tests on CUDA #6957

Fix GHA Linux GPU Job not running tests on CUDA #6957

vfdev-5 commented Nov 16, 2022 •

edited by pytorch-bot bot

Loading

pmeier commented Nov 22, 2022

osalpekar commented Dec 7, 2022 •

edited

Loading

vfdev-5 commented Dec 8, 2022 •

edited

Loading

osalpekar commented Dec 8, 2022

vfdev-5 commented Dec 8, 2022

osalpekar commented Dec 8, 2022

pmeier commented Dec 8, 2022 •

edited

Loading

vfdev-5 commented Jan 20, 2023

Fix GHA Linux GPU Job not running tests on CUDA #6957

Fix GHA Linux GPU Job not running tests on CUDA #6957

Comments

vfdev-5 commented Nov 16, 2022 • edited by pytorch-bot bot Loading

pmeier commented Nov 22, 2022

osalpekar commented Dec 7, 2022 • edited Loading

vfdev-5 commented Dec 8, 2022 • edited Loading

osalpekar commented Dec 8, 2022

vfdev-5 commented Dec 8, 2022

osalpekar commented Dec 8, 2022

pmeier commented Dec 8, 2022 • edited Loading

vfdev-5 commented Jan 20, 2023

vfdev-5 commented Nov 16, 2022 •

edited by pytorch-bot bot

Loading

osalpekar commented Dec 7, 2022 •

edited

Loading

vfdev-5 commented Dec 8, 2022 •

edited

Loading

pmeier commented Dec 8, 2022 •

edited

Loading