Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BACKPORT]Enable CUDA 11.0 on nightly + CUDA 11.2 on pip (#19295)(#19764) #19930

Merged
merged 4 commits into from
Feb 25, 2021

Conversation

access2rohit
Copy link
Contributor

@access2rohit access2rohit commented Feb 19, 2021

Remove CUDA 9.x add CUDA 11.2 support

Backport #19295, #19764 as a part of effort #19911

@mxnet-bot
Copy link

Hey @access2rohit , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, windows-cpu, clang, website, unix-gpu, unix-cpu, edge, miscellaneous, centos-gpu, centos-cpu, sanity]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@access2rohit
Copy link
Contributor Author

@waytrue17 @leezu Can you review ?

@lanking520 lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 19, 2021
@access2rohit access2rohit mentioned this pull request Feb 19, 2021
13 tasks
@lanking520 lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 19, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 22, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 22, 2021
@access2rohit access2rohit changed the title [BACKPORT]Enable CUDA 11.0 on nightly development builds (#19295) [BACKPORT]Enable CUDA 11.0 on nightly + Cuda11.2 on pip (#19295)(#19764) Feb 22, 2021
@access2rohit access2rohit changed the title [BACKPORT]Enable CUDA 11.0 on nightly + Cuda11.2 on pip (#19295)(#19764) [BACKPORT]Enable CUDA 11.0 on nightly + CUDA 11.2 on pip (#19295)(#19764) Feb 23, 2021
@lanking520 lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 23, 2021
Copy link
Contributor

@josephevans josephevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since we need to support the last 2 major versions (11.x and 10.x) we should not remove support for 10.0 in this PR. Rather, can we add 11.1 and 11.2 and leave 10.0 intact? Thanks.

ci/docker/docker-compose.yml Outdated Show resolved Hide resolved
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 23, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Feb 23, 2021
@lanking520 lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 24, 2021
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Feb 24, 2021
Copy link
Contributor

@waytrue17 waytrue17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@lanking520 lanking520 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 24, 2021
Copy link
Contributor

@Zha0q1 Zha0q1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! we can try this pr on a duplicate cd pipeline to verify this works before merging.

@access2rohit
Copy link
Contributor Author

blocked on unix-GPU failing due to error

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown.

@Zha0q1 @josephevans are looking into this issue.

@lanking520 lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 25, 2021
@Zha0q1 Zha0q1 merged commit 57dddd1 into apache:v1.x Feb 25, 2021
@Zha0q1
Copy link
Contributor

Zha0q1 commented Feb 25, 2021

the base image nvidia/cuda:11.2-cudnn8-devel-ubuntu16.04 does not exist, so we probably need to try another image

@access2rohit
Copy link
Contributor Author

access2rohit commented Feb 25, 2021

the base image nvidia/cuda:11.2-cudnn8-devel-ubuntu16.04 does not exist, so we probably need to try another image

Checked https://hub.docker.com/r/nvidia/cuda 11.2.1-cudnn8-runtime-ubuntu16.04 exists so nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu16.04 should work

access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Mar 10, 2021
apache#19764) (apache#19930)

* Enable CUDA 11.0 on nightly development builds (apache#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (apache#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Mar 10, 2021
apache#19764) (apache#19930)

* Enable CUDA 11.0 on nightly development builds (apache#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (apache#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Mar 12, 2021
apache#19764) (apache#19930)

* Enable CUDA 11.0 on nightly development builds (apache#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (apache#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
mseth10 added a commit that referenced this pull request Mar 14, 2021
…20015)

* [BACKPORT]Enable CUDA 11.0 on nightly + CUDA 11.2 on pip (#19295)(#19764) (#19930)

* Enable CUDA 11.0 on nightly development builds (#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* [FEATURE]Migrating all CD pipelines to Ninja build + fix cu112 CD pipeline (#19974)

* migrating cd builds to ninja + removing static links to nvidia libs and leagacy cuda versions

* installing NCCL manually for cuda11.2 container

* set MSHADOW_USE_CUDNN=1 in CMakelists of mshadow to build properly for CUDNN support

* adding coverage to cd requirements file to fix cu100, cu101 and cu102 tests

* updating cd_test containers to ubuntu 18

* adding cmake config for linux native and adding USE_KV_STORE in linux_cpu

* updating zmq builds to statically link to libmxnet.so

* updating toolchains for r, clang and llvm for ubuntu18. OpenBlas Static link for 'distribution' build type only. Fix caffe build to use openCV 3. Remove leagacy Clang 3.9 from CI

* fix versions for pip install in ubuntu_core_sh add new search path for cuDNN

* finxing cudnn link problem for CUDA<=11.0

* adding library paths for libjpegturbo and lapack to fix failing CI on ubuntu 18 images

* removing ASAN integration test from miscellaneous CI as its not required

* fix lapack path for gpu builds

* correctly installing libjpegturbo for ubuntu 18

* updating docker images of r,jekyll,julia etc test containers+ fix java version to 8

* installing libomp.so

* removing debug test as its not required. Code clean-up

* adding alternate URL source for MNIST dataset as original website is down

* skipping flaky tests issue tracked #20011

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* update cudnn from 7 to 8 for cu102 (#19506)

* update cudnn from 7 to 8 for cu102 (#19522)

* downloading MNIST dataset from alternate URL (#20014)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* fixing CI issue with v1.8.x

* addressing review comments

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
mseth10 added a commit to mseth10/incubator-mxnet that referenced this pull request Mar 15, 2021
…pache#20015)

* [BACKPORT]Enable CUDA 11.0 on nightly + CUDA 11.2 on pip (apache#19295)(apache#19764) (apache#19930)

* Enable CUDA 11.0 on nightly development builds (apache#19295)

Remove CUDA 9.2 and CUDA 10.0

* [PIP] add build variant for cuda 11.2 (apache#19764)

* adding ci docker files for cu111 and cu112

* removing previous CUDA make versions and adding support for cuda11.2

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* [FEATURE]Migrating all CD pipelines to Ninja build + fix cu112 CD pipeline (apache#19974)

* migrating cd builds to ninja + removing static links to nvidia libs and leagacy cuda versions

* installing NCCL manually for cuda11.2 container

* set MSHADOW_USE_CUDNN=1 in CMakelists of mshadow to build properly for CUDNN support

* adding coverage to cd requirements file to fix cu100, cu101 and cu102 tests

* updating cd_test containers to ubuntu 18

* adding cmake config for linux native and adding USE_KV_STORE in linux_cpu

* updating zmq builds to statically link to libmxnet.so

* updating toolchains for r, clang and llvm for ubuntu18. OpenBlas Static link for 'distribution' build type only. Fix caffe build to use openCV 3. Remove leagacy Clang 3.9 from CI

* fix versions for pip install in ubuntu_core_sh add new search path for cuDNN

* finxing cudnn link problem for CUDA<=11.0

* adding library paths for libjpegturbo and lapack to fix failing CI on ubuntu 18 images

* removing ASAN integration test from miscellaneous CI as its not required

* fix lapack path for gpu builds

* correctly installing libjpegturbo for ubuntu 18

* updating docker images of r,jekyll,julia etc test containers+ fix java version to 8

* installing libomp.so

* removing debug test as its not required. Code clean-up

* adding alternate URL source for MNIST dataset as original website is down

* skipping flaky tests issue tracked apache#20011

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* update cudnn from 7 to 8 for cu102 (apache#19506)

* update cudnn from 7 to 8 for cu102 (apache#19522)

* downloading MNIST dataset from alternate URL (apache#20014)

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* fixing CI issue with v1.8.x

* addressing review comments

Co-authored-by: waytrue17 <[email protected]>
Co-authored-by: Sheng Zha <[email protected]>
Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants