Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnnxRuntime_test_all fails when building with TensorRT #5232

Closed
mayani-nv opened this issue Sep 21, 2020 · 18 comments
Closed

OnnxRuntime_test_all fails when building with TensorRT #5232

mayani-nv opened this issue Sep 21, 2020 · 18 comments
Assignees
Labels
ep:TensorRT issues related to TensorRT execution provider

Comments

@mayani-nv
Copy link

I followed the instructions described here to build with TensorRT and got the following error

1:   YOU HAVE 4 DISABLED TESTS
1:
1:
1: ----- MEMORY LEAKS: 216 bytes of memory leaked in 6 allocations
1/3 Test #1: onnxruntime_test_all .............***Failed  229.37 sec
test 2
    Start 2: onnx_test_pytorch_converted

2: Test command: C:\Users\AzureUser\onnxruntime\build\Windows\Debug\Debug\onnx_test_runner.exe "C:/Users/AzureUser/onnxruntime/cmake/external/onnx/onnx/backend/test/data/pytorch-converted"
2: Test timeout computed to be: 10000000
2: 2020-09-20 16:46:09.6963656 [E:onnxruntime:Default, testcase_driver.cc:41 onnxruntime::test::TestCaseDriver::RunParallel] Running tests in parallel: at most 6 models at any time
2: 2020-09-20 16:46:17.2699163 [E:onnxruntime:Default, testcase_driver.cc:63 onnxruntime::test::TestCaseDriver::RunModelsAsync] Running tests finished. Generating report
2: result:
2:      Models: 57
2:      Total test cases: 57
2:              Succeeded: 57
2:              Not implemented: 0
2:              Failed: 0
2:      Stats by Operator type:
2:              Not implemented(0):
2:              Failed:
2: Failed Test Cases:
2/3 Test #2: onnx_test_pytorch_converted ......   Passed    9.40 sec
test 3
    Start 3: onnx_test_pytorch_operator

3: Test command: C:\Users\AzureUser\onnxruntime\build\Windows\Debug\Debug\onnx_test_runner.exe "C:/Users/AzureUser/onnxruntime/cmake/external/onnx/onnx/backend/test/data/pytorch-operator"
3: Test timeout computed to be: 10000000
3: 2020-09-20 16:46:18.6645857 [E:onnxruntime:Default, testcase_driver.cc:41 onnxruntime::test::TestCaseDriver::RunParallel] Running tests in parallel: at most 6 models at any time
3: 2020-09-20 16:46:22.0578970 [E:onnxruntime:Default, testcase_driver.cc:63 onnxruntime::test::TestCaseDriver::RunModelsAsync] Running tests finished. Generating report
3: result:
3:      Models: 24
3:      Total test cases: 24
3:              Succeeded: 24
3:              Not implemented: 0
3:              Failed: 0
3:      Stats by Operator type:
3:              Not implemented(0):
3:              Failed:
3: Failed Test Cases:
3/3 Test #3: onnx_test_pytorch_operator .......   Passed    4.78 sec

67% tests passed, 1 tests failed out of 3

Total Test time (real) = 243.60 sec

The following tests FAILED:
          1 - onnxruntime_test_all (Failed)
Errors while running CTest
Traceback (most recent call last):
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 1800, in <module>
    sys.exit(main())
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 1741, in main
    run_onnxruntime_tests(args, source_dir, ctest_path, build_dir, configs)
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 1195, in run_onnxruntime_tests
    run_subprocess(ctest_cmd, cwd=cwd, dll_path=dll_path)
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 446, in run_subprocess
    env=my_env, shell=shell)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['C:\\Program Files\\CMake\\bin\\ctest.EXE', '--build-config', 'Debug', '--verbose']' returned non-zero exit status 8.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 18.04): WS 2016
Visual Studio : 2017 Community
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.7
Python version: 3.7
CUDA version: 10.1
cuDNN version: 7.6.5
GPU card: P40

Thanks in advance

@stevenlix
Copy link
Contributor

We've upgraded TensorRT to 7.1.3.4. Please download the package and use it with CUDA11 and CuDNN 8 on Windows.

@mayani-nv
Copy link
Author

@stevenlix Thanks for the suggestion. I was using the TensorRT 7.1.3.4 and got the error which I posted in the previous message. So do you mean that upgrading from CUDA 10.2 to CUDA 11 and that from cuDNN 7.6.5 to cuDNN 8 will solve this error?

@RandySheriffH RandySheriffH added the ep:TensorRT issues related to TensorRT execution provider label Sep 21, 2020
@ppyun
Copy link

ppyun commented Sep 21, 2020

@stevenlix, Thanks for your reply, Steven. Mohit (@maynai-nv) at Nvidia is trying to build ONNXRT on Windows. His question is whether ORT-TRT 7.1.3.4 with CUDA 10.2 and cuDNN 7.6.5 on Windows verified working or not.

@jywu-msft
Copy link
Member

jywu-msft commented Sep 21, 2020

2 and cuDNN 7.6.5 on Windows verified working or not.

On Windows, I thought TensorRT 7.1.3.4 is only available with CUDA 11.0 (at least that is what the Nvidia download page indicates)
And that is the configuration we've tested on.

@mayani-nv
Copy link
Author

@jywu-msft and @stevenlix. thanks for the suggestion. I updated to CUDA 11.0, cudNN 8.0 and TensorRT 7.1.3.4 and still got the same error

67% tests passed, 1 tests failed out of 3

Total Test time (real) = 521.52 sec

The following tests FAILED:
          1 - onnxruntime_test_all (Failed)
Errors while running CTest
Traceback (most recent call last):
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 1800, in <module>
    sys.exit(main())
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 1741, in main
    run_onnxruntime_tests(args, source_dir, ctest_path, build_dir, configs)
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 1195, in run_onnxruntime_tests
    run_subprocess(ctest_cmd, cwd=cwd, dll_path=dll_path)
  File "C:\Users\AzureUser\onnxruntime\\tools\ci_build\build.py", line 446, in run_subprocess
    env=my_env, shell=shell)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['C:\\Program Files\\CMake\\bin\\ctest.EXE', '--build-config', 'Debug', '--verbose']' returned non-zero exit status 8.

@jywu-msft
Copy link
Member

can you attach the full output log?

@kzuiderveld
Copy link

kzuiderveld commented Sep 22, 2020

Please note that I submitted a bug report weeks ago that TensorRT was broken on Windows with the latest NVIDIA binaries (#4841). I have been waiting for a follow-up.

Can someone confirm that TensorRT 7.1.3.4/Cuda11.0/CuDnn8 is now working correctly on Windows?

@mayani-nv
Copy link
Author

@jywu-msft The following is the full ouptut log.
log.txt

@snnn
Copy link
Member

snnn commented Sep 23, 2020

" ----- MEMORY LEAKS: 216 bytes of memory leaked in 6 allocations"

That's the reason.

Please try to run 'onnxruntime_test_all' in visual studio, it will give you more information.

@mayani-nv
Copy link
Author

@snnn Thanks for the suggestion. I tried running 'onnxruntime_test_all'in Visual studio and got the following information.
vslog.txt

@snnn
Copy link
Member

snnn commented Sep 25, 2020

@mayani-nv Thank you for the valuable information, we'll start to fix them.

@Linux13524
Copy link

" ----- MEMORY LEAKS: 216 bytes of memory leaked in 6 allocations"

I have the exact same error message, but Im not building with TensorRT:

.\build.bat --parallel --config Debug --build_dir build --cmake_generator "Visual Studio 16 2019" 

full.log

@snnn
Copy link
Member

snnn commented Nov 4, 2020

Please try to run 'onnxruntime_test_all' in visual studio, it will tell us which allocation leaked.

@snnn
Copy link
Member

snnn commented Nov 4, 2020

@Linux13524 Could you please give us the onnxruntime commit id you used in the build?

@Linux13524
Copy link

@snnn Here is the log from Visual Studio: vs.log

The commit-id is 5de47af which is v1.5.1.

@snnn snnn self-assigned this Nov 5, 2020
@snnn
Copy link
Member

snnn commented Nov 5, 2020

Hi @Linux13524

Thank you. I can reproduce the error but I think the latest master is fine. As a workaround, you may go to https://github.com/microsoft/onnxruntime/blob/master/tools/ci_build/build.py#L965 and set "-Donnxruntime_ENABLE_MEMLEAK_CHECKER=OFF" for all the cases.

@Linux13524
Copy link

@snnn Thanks a lot for the info! I will try and see if the latest master does work, otherwise I will use your workaround.

@Linux13524
Copy link

Latest master works. Thanks again @snnn!

@snnn snnn closed this as completed Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:TensorRT issues related to TensorRT execution provider
Projects
None yet
Development

No branches or pull requests

8 participants