Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Navi21][HIP] test_pooling2d and other unit tests are failing #1141

Closed
junliume opened this issue Sep 7, 2021 · 10 comments
Closed

[Navi21][HIP] test_pooling2d and other unit tests are failing #1141

junliume opened this issue Sep 7, 2021 · 10 comments

Comments

@junliume
Copy link
Contributor

junliume commented Sep 7, 2021

The following tests are failing consistently on gfx1030, please take a look:
http://micimaster.amd.com/blue/rest/organizations/jenkins/pipelines/MLLibs/pipelines/MIOpen/branches/ci_gfx1030_ocl_to_hip/runs/7/nodes/10/steps/40/log/?start=0

[2021-09-02T00:03:15.364Z] 3/82 Test #37: test_pooling2d .............................***Failed 6.92 sec
[2021-09-02T00:03:15.364Z] Memory access fault by GPU node-2 (Agent handle: 0x69c8b0) on address 0x7fc8ad3f7000. Reason: Page not present or supervisor privilege.
[2021-09-02T00:03:15.364Z] CMake Error at test_test_pooling2d.cmake:7 (message):
[2021-09-02T00:03:15.364Z] Test failed

Hence if running HIP check on gfx1030, several more unit tests failed, for example:

[2021-09-02T00:18:40.264Z] /var/jenkins/workspace/ibs_MIOpen_ci_gfx1030_ocl_to_hip/build/bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --input 64 160 73 73 --weights 64 160 1 1 --batch_size 64 --input_channels 160 --output_channels 64 --spatial_dim_elements 73 73 --filter_dims 3 3 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW
[2021-09-02T00:18:40.264Z] FAILED: 0.000689474

Let's talk about priorities of these failure checks separately, but this issue is to track the problems we would be facing soon.

@carlushuang
Copy link
Contributor

@junliume I tested both rocm-4.1/4.3, both HIP/OCL backend, but seems can't reproduce this navi crash of test_pooling2d? Does it related to CI hardware? I tested on ixt-sjc2-63

@junliume
Copy link
Contributor Author

junliume commented Sep 8, 2021

@junliume I tested both rocm-4.1/4.3, both HIP/OCL backend, but seems can't reproduce this navi crash of test_pooling2d? Does it related to CI hardware? I tested on ixt-sjc2-63

@carlushuang it is still pretty persistent: (the following pipeline is just now)
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/ci_gfx1030_ocl_to_hip/8/pipeline

@carlushuang
Copy link
Contributor

@junliume Hi, I tested on ixt-sjc2-63, with latest MIOpen, and rebuild docker, miopen and do test (all the command is the same as that Jenkins stage), but still everything can pass. Below is the command, here is the log
log.txt

cd /dockerx/repo/MIOpen/
mkdir build && cd build
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS=-Werror cmake -DMIOPEN_TEST_FLAGS= --disable-verification-cache  -DCMAKE_BUILD_TYPE=release -DBUILD_DEV=Off -DCMAKE_INSTALL_PREFIX=../install -DMIOPEN_GPU_SYNC=Off -DMIOPEN_TEST_ALL=On ..
CTEST_PARALLEL_LEVEL=4 MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 MIOPEN_DEBUG_CONV_IMPLICIT_GEMM_HIP_FWD_V4R1=0 dumb-init make -j192 install check 2>&1 | tee log.txt

@junliume
Copy link
Contributor Author

junliume commented Sep 9, 2021

@junliume Hi, I tested on ixt-sjc2-63, with latest MIOpen, and rebuild docker, miopen and do test (all the command is the same as that Jenkins stage), but still everything can pass. Below is the command, here is the log
log.txt

cd /dockerx/repo/MIOpen/
mkdir build && cd build
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS=-Werror cmake -DMIOPEN_TEST_FLAGS= --disable-verification-cache  -DCMAKE_BUILD_TYPE=release -DBUILD_DEV=Off -DCMAKE_INSTALL_PREFIX=../install -DMIOPEN_GPU_SYNC=Off -DMIOPEN_TEST_ALL=On ..
CTEST_PARALLEL_LEVEL=4 MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 MIOPEN_DEBUG_CONV_IMPLICIT_GEMM_HIP_FWD_V4R1=0 dumb-init make -j192 install check 2>&1 | tee log.txt

Indeed interesting and tricky one, you can check pipelines here:
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/ci_gfx1030_ocl_to_hip/11/pipeline/
All of the pipelines failed pretty consistently at similar locations. Actually they stuck for hours and I have to manually kill the jobs.
I will look more into them tomorrow :)

@carlushuang
Copy link
Contributor

Hi @junliume yes, I checked this pipeline, and it seems even the clinfo of this and ixt-sjc2-63 is the same. very strange

@junliume
Copy link
Contributor Author

@atamazov and @carlushuang
A trial with CTEST_PARALLEL_LEVEL=1 seems to have stabilized CI tests for gfx1030, include HIP backend which used to never pass.
I will monitor
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_WA1053/11/pipeline/689/
and
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_%231167/1/pipeline/

It tried to see if CTEST_PARALLEL_LEVEL=2 would stabilize the CI too, because running them in serial is really slow.

@junliume
Copy link
Contributor Author

test_pooling2d and other unit tests are stable with serial run

@atamazov
Copy link
Contributor

@junliume Do we know the root reason of the issue?

@junliume
Copy link
Contributor Author

@junliume Do we know the root reason of the issue?

No, unfortunately. It's not likely MIOpen's own issue since it's very platform specific, meanwhile if we try to submit a ticket for runtime/compiler it's hard to persuade them to accept since these are MIOpen's unit tests ...
@carlushuang @asroy @zjing14 @JehandadKhan any suggestions please?

@atamazov
Copy link
Contributor

Let's agree that the root reason of this issue is #1148. In this case we can keep this issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants