Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failures of release branch release/0.15 #7299

Closed
6 of 8 tasks
NicolasHug opened this issue Feb 21, 2023 · 12 comments
Closed
6 of 8 tasks

CI Failures of release branch release/0.15 #7299

NicolasHug opened this issue Feb 21, 2023 · 12 comments

Comments

@NicolasHug
Copy link
Member

NicolasHug commented Feb 21, 2023

The latest status of the release/0.15 CI can be checked in #7292. Current failures are:

2023-02-21T17:06:25.6104129Z Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
2023-02-21T17:06:25.6104506Z 	Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
  File "/opt/conda/lib/python3.10/site-packages/conda/core/package_cache_data.py", line 803, in execute
    raise CondaMultiError(exceptions)
conda.CondaMultiError: Downloaded bytes did not match Content-Length
  url: https://conda.anaconda.org/nvidia/linux-64/libcusparse-11.7.4.91-0.tar.bz2
test/test_backbone_utils.py::TestFxFeatureExtraction::test_forward_backward[regnet_y_1_6gf] PASSED [  0%]
/var/folders/yh/q3_29drn3d3bw226r0xgt9180000gn/T/tmpj5jgojmn: line 3: 94052 Killed: 9               python3 -u -mpytest -v --tb=long --durations 20
ERROR conda.cli.main_run:execute(47): `conda run python3 -u -mpytest -v --tb=long --durations 20` failed. (See above for error)
test/test_backbone_utils.py::TestFxFeatureExtraction::test_forward_backward[regnet_y_32gf] 
Error: Process completed with exit code 137.
CMake Error in /home/circleci/project/cpp_build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  CUDA_ARCHITECTURES is empty for target "cmTC_4ad9f".


CMake Error in /home/circleci/project/cpp_build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  CUDA_ARCHITECTURES is empty for target "cmTC_4ad9f".


CMake Error at /usr/local/lib64/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/CMakeDetermineCompilerABI.cmake:48 (try_compile):
  Failed to generate test project build system.
Call Stack (most recent call first):
  /usr/local/lib64/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/CMakeTestCUDACompiler.cmake:19 (CMAKE_DETERMINE_COMPILER_ABI)
  CMakeLists.txt:12 (enable_language)
RuntimeError: 
The detected CUDA version (10.2) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

@atalman @malfet @seemethere we'd be grateful for your help on the issues listed above, thanks!

cc @seemethere

@pmeier
Copy link
Collaborator

pmeier commented Feb 22, 2023

Note that none of that is specific to the release branch, but has been going on for a while on main

#7180

#7184

Expected, since we are pulling dependencies from the conda default channel and they only very recently added Python 3.11 (I'll have to mention a few months late) and so far no dependencies have been updated. I'm actively working on changing the way we procure our dependencies, but this is not ready for the release. So I guess we'll have to have yet another release without support without 3.11?

@atalman
Copy link
Contributor

atalman commented Feb 27, 2023

This PR: #7332 Addressed all circleci infra failures:

CMake Error in /home/circleci/project/cpp_build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  CUDA_ARCHITECTURES is empty for target "cmTC_4ad9f".


CMake Error in /home/circleci/project/cpp_build/CMakeFiles/CMakeTmp/CMakeLists.txt:
  CUDA_ARCHITECTURES is empty for target "cmTC_4ad9f".


CMake Error at /usr/local/lib64/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/CMakeDetermineCompilerABI.cmake:48 (try_compile):
  Failed to generate test project build system.
Call Stack (most recent call first):
  /usr/local/lib64/python3.6/site-packages/cmake/data/share/cmake-3.18/Modules/CMakeTestCUDACompiler.cmake:19 (CMAKE_DETERMINE_COMPILER_ABI)
  CMakeLists.txt:12 (enable_language)
RuntimeError: 
The detected CUDA version (10.2) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

@NicolasHug
Copy link
Member Author

Thanks a lot @atalman , I updated the list above. Looks like the remaining issues are the Unit-tests on Linux CPU / tests (3.10) / linux-job and cmake_windows_gpu and cmake_windows_cpu

@malfet
Copy link
Contributor

malfet commented Feb 27, 2023

Unittest 3.10 were initially broken by #7288 which was later cherry-picked into release branch and occluded all errors. Question is: why was it landed despite introducing regressions?

@NicolasHug
Copy link
Member Author

@maleft #7288 did not break the 3.10 unittest. It is in fact surfacing an existing issue that was previously missed. #7288 is not the problem it is merely the messenger.

The test we added in #7288 makes sure (among other things) that no warning is raised when one writes import torchvision. If there's a warning, the test fails. The failure we're observing on the 3.10 is that the test fails because of the MKL-related warning that is thrown. This does seem to be an env/CI problem: @pmeier is consolidating this GA workflow in #7189 (with a different setup) and those tests run fine there.

@atalman
Copy link
Contributor

atalman commented Mar 1, 2023

windows cmake isses are resolved.

Following issue is flaky:
Build Linux Conda / pytorch/vision / conda-py3_9-cuda11_7

Resolved on rerun: https://github.com/pytorch/vision/actions/runs/4292476260

@malfet
Copy link
Contributor

malfet commented Mar 1, 2023

By the way, I can't reproduce the warning importing torchvision by running:

$ conda create -n py310 -y python==3.10 torchvision cpuonly -c pytorch-test
$ conda run -n py310 python -c "import sys,torchvision;print(sys.version_info, torchvision.__version__, torchvision.torch.__version__)"
sys.version_info(major=3, minor=10, micro=0, releaselevel='final', serial=0) 0.15.0 2.0.0

@malfet
Copy link
Contributor

malfet commented Mar 1, 2023

@maleft #7288 did not break the 3.10 unittest. It is in fact surfacing an existing issue that was previously missed. #7288 is not the problem it is merely the messenger.

#7288 obscures the signal. If you know that something will fail, one can create an issue and at the same time file a PR that adds @unittest.expectedFailure decorator. But turning green signal into red results lost signal, opening up possibility for more regressions and developing a habit of ignoring PR CI signal, which is IMO bad. I.e. if you don't care about signal being green, why not disable CI altogether.

@NicolasHug
Copy link
Member Author

I share your concern that signal is obscured. This is not a common situation. Considering this seemed like a legitimate critical CI/packaging issue, and with the approaching release date, the risk of hiding this failure seemed greater to me than keeping the signal red for a few more days.

@malfet
Copy link
Contributor

malfet commented Mar 1, 2023

Regarding M1 failures: looks like one of the tests intermittently crashes: https://hud.pytorch.org/hud/pytorch/vision/main/1?per_page=50&name_filter=unit-tests%20on%20M1 Is this is a release blocker?

@NicolasHug
Copy link
Member Author

I opened #7373 to skip the failing test on 3.10, and also #7372 to keep track of the issue

@NicolasHug
Copy link
Member Author

All issues have been resolved or circumvented, so I'll close this issue. Thanks a lot everyone for your help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants