-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorRT 7.1.3 / Cuda11 / CuDNN 8 - supported yet? #4841
Comments
Could you please provide more information? We plan to release ORT 1.5 with TensorRT 7.1, it should work. |
Onnx runtime version: a3c953 Attached one of the output channels. Visual inspection shows they're similar, but quite different in places yielding different results. Only change in the code is switching from default to TensorRT provider, so differences originate inside Onnx Runtime or TensorRT. TensorRT fp16 and fp32 gave virtually identical results (different from CPU version). Any suggestions on how to troubleshoot this most effectively? |
@stevenlix Do you know why? |
we're unaware of a general systemic issue with ORT + TensorRT EP. |
Hello, I re-ported TensorRT 7.1.3.4 - I don't believe it is working correctly. Lots of errors like that suggest that tests are failing. Comments? I uploaded the output of onnxruntime_test_all.exe here: https://1drv.ms/u/s!AtdjyUUyTjd8_RzBG2s9B4RMYlt9?e=S7EVyA Here are my installation notes:
83% tests passed, 1 tests failed out of 6 Total Test time (real) = 1322.91 sec The following tests FAILED: F:\onnxruntime> |
Thanks for the details. 1: [----------] Global test environment tear-down |
@jywu-msft Yes, I think we should. |
FYI I also have colleague reported this kind of issue (i.e. mismatch way over fp error) with TensorRT provider. |
for a specific model? or is it the same case here, where the unit tests fail for certain environments? |
can you re-test with the latest master? |
@jywu-msft |
can you please be more specific. |
I was specific in the bug report I originally filed. Here is the output of a test run using the latest branch: https://1drv.ms/u/s!AtdjyUUyTjd8gP44fraVGsr0N4_tUQ?e=d3fSjl Same issue remains: TensorRT does not give accurate results and is unusable. The tests confirm this. |
I examined your logfile. Yes, it does seem like a lot of tests are failing with incorrect results. |
@jywu-msft |
thanks for confirming. |
While the tests are successful now, the TensorRT path gives results that are significantly different compared to the CUDA and CPU brethens. We’re not there yet.
Karel
From: George Wu <[email protected]>
Sent: Wednesday, September 23, 2020 16:12
To: microsoft/onnxruntime <[email protected]>
Cc: kzuiderveld <[email protected]>; State change <[email protected]>
Subject: Re: [microsoft/onnxruntime] TensorRT 7.1.3 / Cuda11 / CuDNN 8 - supported yet? (#4841)
@jywu-msft <https://github.com/jywu-msft>
Ding, ding, ding! Yes, that environment variable was enabled and set to 1. Setting it to zero fixed the issue.
thanks for confirming.
yes, that feature is experimental and still has some kinks. we will fix it.
@stevenlix <https://github.com/stevenlix> , @chilo-ms <https://github.com/chilo-ms> FYI
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub <#4841 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGWGFN3UK5AIKF5GK7SWXETSHJXBRANCNFSM4QD5RP3A> . <https://github.com/notifications/beacon/AGWGFN6KONOFQLNGRUVZGIDSHJXBRA5CNFSM4QD5RP3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFGNJ3VQ.gif>
|
is this running a specific model? |
I'm reopening this issue as the TensorRT results are still incorrect. |
how do we gain access to the model or a stripped down repro test case? |
I can provide the model if it is kept confidential.
…Sent from my iPhone
On Sep 28, 2020, at 17:37, George Wu ***@***.***> wrote:
how do we gain access to the model or a stripped down repro test case?
maybe the latter would be easier if you don't want to share your model.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
It will be great if you can share your model for us to debug the issue. In the meanwhile, I am wondering if you could try ORT rel-1.4.0 (integrated with TRT7.0) to see if the accuracy issue is also there. |
we started a thread off-line about model sharing. |
I tried the model in ORT-TRT7.0 and there is no accuracy issue. It seems a regression in TRT7.1 itself. Will work with Nvidia for further investigation. |
Update: we found the issue may be related to opset11 support in TRT7.1. When model is converted to opset10, the accuracy issue seems gone. Could you convert your model back to opset10 and try again? |
Thanks for the update, wonderful to hear that something was broken indeed. I'll try again ASAP. |
Our scientists say that converting to Opset11 is not possible without changing the Unet model (hardcoding the padding function call input values). That would be our plan B. |
The issue is still under investigation by Nvidia. |
The issue has been confirmed by Nvidia, which is caused by opset11 Resize operator. There is a PR (#5442) in ORT to include the fix. Please try this PR or pull master after it's merged. I've tested the model using random data and accuracy issue seems gone. |
That's awesome news. I'll wait until #5442 is merged into master and then test whether the issue has been resolved. |
The good news: the issue I reported has been fixed. TensorRT yields similar results compared to CUDA and CPU. The bad news: the TensorRT implementation is about 2.6x slower compared to the CUDA implementation (Quadro RTX 4000 using fp16). When I tried TensorRT earlier this year, I had a significant performance improvement over CUDA that now completely vanished. What would the root cause be for this performance degradation? Does the fix disable part of the TensorRT path and falls back to a CPU provider (or CUDA perhaps)? |
@stevenlix Any update on a fix from NVIDIA? I noticed that TRT 7.2.1 arrived... |
Does ONNX compile with TRT 7.2.1? |
@stevenlix and others This issue now has been resolved. |
great. thanks for the update! |
|
Hi,
I pulled the latest Onnx runtime version and compiled with the latest NVIDIA libraries (see title). The tests passed except for the Python test (which is expected).
Unfortunately, the TensorRT path gives significantly different results compared to the CPU path (I can tell differences by visual inspection of the activation output).
Are there currently known compatibility issues with the latest TensorRT/CuDNN/Cuda libraries? If so, is there an ETA for a fix? If not, is there somebody I can work with to troubleshoot this further?
Karel
The text was updated successfully, but these errors were encountered: