Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running quantized onnx model #1543

Closed
hossein1387 opened this issue Aug 1, 2019 · 15 comments
Closed

Error running quantized onnx model #1543

hossein1387 opened this issue Aug 1, 2019 · 15 comments
Assignees
Labels
quantization issues related to quantization

Comments

@hossein1387
Copy link

hossein1387 commented Aug 1, 2019

Describe the bug
Using Quantization tool I quantized VGG.onnx and got VGG_Quant.onnx. However, when I try to run the quantized model I get:

RuntimeError: [ONNXRuntimeError] : 1 : GENERAL ERROR : Load model from VGG_Quant.onnx failed:[ShapeInferenceError] Incompatible dimensions

Running the original onnx model (VGG.onnx) with the same setup (same dataset) does not produce any error. The error occur when I try to create an InferenceSession, here is how I try to run my code:

 options = onnxrt.SessionOptions()
 sess = self.onnxrt.InferenceSession('VGG_Quant.onnx', options)

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.5 LTS x86_64 GNU/Linux
  • ONNX Runtime installed from (source or binary): pip
  • ONNX Runtime version: 0.4.0
  • Python version: Python 3.7.3

The original model was downloaded from Pytorch zoo models and then converted to onnx (which again runs perfectly fine with onnxruntime).

@faxu faxu added the bug label Aug 1, 2019
@askhade askhade self-assigned this Aug 1, 2019
@askhade
Copy link
Contributor

askhade commented Aug 1, 2019

@hossein1387 Can you share the fp32 onnx model and the quantization script input params which you chose while quantizing the model.

@hossein1387
Copy link
Author

Here is the original fp32 model, and here is the quantized version. I quantized the model using the following code:

import onnx
from quantize import quantize, QuantizationMode

# Load the onnx model
model = onnx.load('VGG.onnx')
# Quantize
quantized_model = quantize(model, quantization_mode=QuantizationMode.IntegerOps)
# Save the quantized model
onnx.save(quantized_model, 'VGG_Quant.onnx')

@hossein1387 hossein1387 changed the title Error running quantized onnx Error running quantized onnx model Aug 1, 2019
@hossein1387
Copy link
Author

I think there is a serious bug with the quantizer module. I trained an MLP model and passed the trained model through the quantizer and realized the quantizer did not do anything. After going through the code I found out that the quantizer only quantizes the Conv and MatMul nodes. I don't know why a GEMM operator (which can be found in MLP and FC layers) is not a MatMul. Anyway, I then tried to train a network with only one Conv layer and a FC layer. The following shows the graph of my network:

fp32

quant

On top is the fp32 version and the bottom graph shows the quantized version. The model passed through the quantizer successfully but again I was unable to run the model using the onnx runtime and got the same error as before.

@askhade
Copy link
Contributor

askhade commented Aug 2, 2019

@hossein1387 : Thanks for the detailed info...
We are working towards strengthening support for quantization including the quantization tooling.

I will update the quantization script to include GEMM as well and will update you once I root cause the shape inference bug in the quantized model.

@askhade
Copy link
Contributor

askhade commented Aug 6, 2019

@hossein1387 : The reason for shape inference failure is an invalid default value in quantize script which is not supported by runtime yet... We do plan to add per-channel quantization support but it is not available today...
My PR reference above should resolve this issue and in the meanwhile you can also try this instead:
quantized_model = quantize(model, quantization_mode=QuantizationMode.IntegerOps, per_channel=False)

@hossein1387
Copy link
Author

hossein1387 commented Aug 6, 2019

Thanks for the update, I ran two models with and without quantization, both models are VGG like and both are using CIFAR100, here is some results:

  Model1
  FP32 Quantized
Accuracy 72.28% 72.27%
Exec Time 14.4ms 53.9ms
Size 77MB 19MB
  Model2
  FP32 Quantized
Accuracy 73.99% 73.97%
Exec Time 7.6ms 50.29ms
Size 20MB 5.2MB

I dont understand why the quantized version is taking much longer than original model? Shouldn't the 8 bit quantized model take less than the FP32 model?

@hossein1387
Copy link
Author

@askhade any idea on why I am gettig these results?

@askhade
Copy link
Contributor

askhade commented Aug 21, 2019

@hossein1387 : which platform are you running on? We don't have optimized kernel support for windows yet... this work is in progress. On Linux the perf should be better than windows but we only support single threaded kernels...

@faxu
Copy link
Contributor

faxu commented Sep 9, 2019

@hossein1387 could you provide more info on this if you still require assistance?

@faxu faxu added the pending label Sep 9, 2019
@hossein1387
Copy link
Author

hossein1387 commented Sep 11, 2019

Thanks for your responses @askhade @faxu.
Here is my platform information:

python version: 3.7.3
python build version: ('default', 'Mar 27 2019 22:11:17')
python compiler version: GCC 7.3.0
python implementation: CPython
os: Linux
os kernel version: #201806252030 SMP Tue Jun 26 00:33:17 UTC 2018
os release version: 4.17.3-041703-generic
os platform: Linux-4.17.3-041703-generic-x86_64-with-debian-stretch-sid
linux distribution: Debian
uname: uname_result(system='Linux', node='TANDEM-TL0275U', release='4.17.3-041703-generic', version='#201806252030 SMP Tue Jun 26 00:33:17 UTC 2018', machine='x86_64', processor='x86_64')
architecture: ('64bit', '')
machine: x86_64

When I check my quantized model, the onnx graph has many more operations compared to the original floating point model. I am not sure why that is and why cant we just use 8bit operations/operators, but as a results of this design choice, the 8bit model execution time is much more than the floating point model. It would be awsome if someone explain to me about these design choices.

@WilliamZhaoz
Copy link

I have the same problem and same question. How can I get inference acceleration on onnxruntime?

@faxu faxu removed the pending label Oct 1, 2019
@hariharans29 hariharans29 added the quantization issues related to quantization label Oct 10, 2019
@askhade
Copy link
Contributor

askhade commented Oct 30, 2019

@hossein1387 :
Regarding the extra node additions : ONNX does not have a lot of quantized operators yet so we need to resort to FP32 to 8 bit conversions in between... We are planning to add more ops to the quantized ops list which will improve this situation and we are also adding fusions to fuse these extra nodes into single nodes... Both of these will reduce the number of ops we add.

Regarding achieving acceleration with quantized models... As part ort 1.0 release we added optimized kernels for matmul operations however optimized kernel work for convolutions is still in progress... Once this is done then the model should experience significant speedup than today...

@hossein1387
Copy link
Author

@askhade Thank you very much for your response.

@askhade askhade closed this as completed Nov 6, 2019
@gbolin
Copy link

gbolin commented Apr 21, 2020

Thanks for the update, I ran two models with and without quantization, both models are VGG like and both are using CIFAR100, here is some results:

  Model1
  FP32 Quantized
Accuracy 72.28% 72.27%
Exec Time 14.4ms 53.9ms
Size 77MB 19MB
  Model2
  FP32 Quantized
Accuracy 73.99% 73.97%
Exec Time 7.6ms 50.29ms
Size 20MB 5.2MB
I dont understand why the quantized version is taking much longer than original model? Shouldn't the 8 bit quantized model take less than the FP32 model?

I have the same issue, quantized model double lower and take more time

@datpham270198
Copy link

now onnx quantized still very slow. Does this isue solved? Please reply me. Thanks a lot. @askhade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

7 participants