Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GHA Linux GPU Job not running tests on CUDA #6957

Closed
vfdev-5 opened this issue Nov 16, 2022 · 8 comments
Closed

Fix GHA Linux GPU Job not running tests on CUDA #6957

vfdev-5 opened this issue Nov 16, 2022 · 8 comments
Assignees

Comments

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Nov 16, 2022

Currently "Unit-tests on Linux GPU" is not running cuda tests but supposed to.

For example, on main:

test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center1-32-26-cuda] SKIPPED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-7-33-cpu] PASSED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-7-33-cuda] SKIPPED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-26-26-cpu] PASSED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-26-26-cuda] SKIPPED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-32-26-cpu] PASSED [ 53%]
test/test_functional_tensor.py::TestRotate::test_rotate[fn1-fill3-False--146-dt2-center2-32-26-cuda] SKIPPED [ 53%]

Context: #6804 (comment)

cc @seemethere

@pmeier
Copy link
Collaborator

pmeier commented Nov 22, 2022

The issue is twofold.

  1. CUDA setup seems to be wrong. At least torch.cuda.is_available() is False. I've looked into it and it seems the Nvidia driver is either not installed or not setup correctly. See skip CPU tests on GPU GHA jobs #6970 (comment). @osalpekar could you have a look. My suggestion is to include an python -c "import torch; exit(not torch.cuda.is_available())" after the torch installation to make sure everything is in order.

  2. Our current GPU tests are somewhat "hardcoded" to CircleCI:

    vision/test/conftest.py

    Lines 15 to 24 in 4a310f2

    def pytest_collection_modifyitems(items):
    # This hook is called by pytest after it has collected the tests (google its name to check out its doc!)
    # We can ignore some tests as we see fit here, or add marks, such as a skip mark.
    #
    # Typically here, we try to optimize CI time. In particular, the GPU CI instances don't need to run the
    # tests that don't need CUDA, because those tests are extensively tested in the CPU CI instances already.
    # This is true for both CircleCI and the fbcode internal CI.
    # In the fbcode CI, we have an additional constraint: we try to avoid skipping tests. So instead of relying on
    # pytest.mark.skip, in fbcode we literally just remove those tests from the `items` list, and it's as if
    # these tests never existed.

    The tests should still be run, but the CPU ones will not be skipped. After 1. above is resolved, skip CPU tests on GPU GHA jobs #6970 resolves this.

@osalpekar
Copy link
Member

osalpekar commented Dec 7, 2022

@pmeier @vfdev-5 I Investigated this issue a bit. I see that torch.cuda.is_available() is True. I'm also able to create a torch tensor on the GPU and check its device affinity. Also able to print the output of nvidia-smi. Here is the link to the job for reference: https://github.com/pytorch/test-infra/actions/runs/3643372591/jobs/6151532786. Perhaps something changed in the linux job or our build containers under the hood in the past few weeks. Shall we revisit the fix for #2 now that #1 appears to be working?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Dec 8, 2022

@osalpekar thanks for investigations! Which PRs you are mentioning, the ones in your message (#1 and #2) are probably unrelated ?
I restarted GPU job for transforms v2 : https://github.com/pytorch/vision/actions/runs/3643938504/jobs/6152659453 , https://github.com/pytorch/vision/actions/runs/3644078231/jobs/6152944170, so let's see if it is working now.

@osalpekar
Copy link
Member

@vfdev-5 Ah good point! By # 1 and # 2, I was referring to @pmeier's two points in the previous comment. Thanks for kicking off that job. Would love to see if #6970 is unblocked now that the cuda drivers seem to be working

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Dec 8, 2022

@osalpekar the job is still failing with missing cuda, https://github.com/pytorch/vision/actions/runs/3644078231/jobs/6152944170#logs

2022-12-08T00:48:14.8403566Z + python3 -c 'import torch; exit(not torch.cuda.is_available())'
2022-12-08T00:48:16.1029960Z ##[error]Process completed with exit code 1.
2022-12-08T00:48:16.1054680Z Prepare all required actions

@osalpekar
Copy link
Member

@vfdev-5 A couple differences I see in the jobs working correctly and this one:

  1. Looks like we're using slightly different conda installation commands. Our build pipelines that use the same container for cuda11.6 install as such conda install -v -y -c pytorch-"${CHANNEL}" -c nvidia pytorch pytorch-cuda="${GPU_ARCH_VERSION}". Based on the logs it does seem like CUDA-enabled pytorch is installed, but the logs are a bit hard to pass because of the many-thousand line output from conda install.
  2. Our test workflows use pip to install torch nightly instead of conda.
  3. The instance type is different - the jobs that are working use linux.4xlarge.nvidia.gpu whic is a g3 instance as opposed to the g5 instances used here.

For points 2 and 3, these are mostly just observations, ideally conda installing on any instance in the pool should work fine. The first suggestion might be worth looking into.

@pmeier
Copy link
Collaborator

pmeier commented Dec 8, 2022

@osalpekar The issue persists for #6970: https://github.com/pytorch/vision/actions/runs/3648214763/jobs/6161369227

Edit: it seems I misunderstood. They issue should be gone after #7019.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Jan 20, 2023

I think this is not relevant anymore. We can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants