Error running quantized onnx model #1543

hossein1387 · 2019-08-01T15:42:42Z

Describe the bug
Using Quantization tool I quantized VGG.onnx and got VGG_Quant.onnx. However, when I try to run the quantized model I get:

RuntimeError: [ONNXRuntimeError] : 1 : GENERAL ERROR : Load model from VGG_Quant.onnx failed:[ShapeInferenceError] Incompatible dimensions

Running the original onnx model (VGG.onnx) with the same setup (same dataset) does not produce any error. The error occur when I try to create an InferenceSession, here is how I try to run my code:

 options = onnxrt.SessionOptions()
 sess = self.onnxrt.InferenceSession('VGG_Quant.onnx', options)

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.5 LTS x86_64 GNU/Linux
ONNX Runtime installed from (source or binary): pip
ONNX Runtime version: 0.4.0
Python version: Python 3.7.3

The original model was downloaded from Pytorch zoo models and then converted to onnx (which again runs perfectly fine with onnxruntime).

The text was updated successfully, but these errors were encountered:

askhade · 2019-08-01T16:47:50Z

@hossein1387 Can you share the fp32 onnx model and the quantization script input params which you chose while quantizing the model.

hossein1387 · 2019-08-01T16:54:58Z

Here is the original fp32 model, and here is the quantized version. I quantized the model using the following code:

import onnx
from quantize import quantize, QuantizationMode

# Load the onnx model
model = onnx.load('VGG.onnx')
# Quantize
quantized_model = quantize(model, quantization_mode=QuantizationMode.IntegerOps)
# Save the quantized model
onnx.save(quantized_model, 'VGG_Quant.onnx')

hossein1387 · 2019-08-02T18:54:20Z

I think there is a serious bug with the quantizer module. I trained an MLP model and passed the trained model through the quantizer and realized the quantizer did not do anything. After going through the code I found out that the quantizer only quantizes the Conv and MatMul nodes. I don't know why a GEMM operator (which can be found in MLP and FC layers) is not a MatMul. Anyway, I then tried to train a network with only one Conv layer and a FC layer. The following shows the graph of my network:

On top is the fp32 version and the bottom graph shows the quantized version. The model passed through the quantizer successfully but again I was unable to run the model using the onnx runtime and got the same error as before.

askhade · 2019-08-02T21:15:58Z

@hossein1387 : Thanks for the detailed info...
We are working towards strengthening support for quantization including the quantization tooling.

I will update the quantization script to include GEMM as well and will update you once I root cause the shape inference bug in the quantized model.

askhade · 2019-08-06T00:44:23Z

@hossein1387 : The reason for shape inference failure is an invalid default value in quantize script which is not supported by runtime yet... We do plan to add per-channel quantization support but it is not available today...
My PR reference above should resolve this issue and in the meanwhile you can also try this instead:
quantized_model = quantize(model, quantization_mode=QuantizationMode.IntegerOps, per_channel=False)

hossein1387 · 2019-08-06T04:18:10Z

Thanks for the update, I ran two models with and without quantization, both models are VGG like and both are using CIFAR100, here is some results:

	Model1
	FP32	Quantized
Accuracy	72.28%	72.27%
Exec Time	14.4ms	53.9ms
Size	77MB	19MB

	Model2
	FP32	Quantized
Accuracy	73.99%	73.97%
Exec Time	7.6ms	50.29ms
Size	20MB	5.2MB

I dont understand why the quantized version is taking much longer than original model? Shouldn't the 8 bit quantized model take less than the FP32 model?

hossein1387 · 2019-08-16T14:57:20Z

@askhade any idea on why I am gettig these results?

askhade · 2019-08-21T20:09:08Z

@hossein1387 : which platform are you running on? We don't have optimized kernel support for windows yet... this work is in progress. On Linux the perf should be better than windows but we only support single threaded kernels...

faxu · 2019-09-09T22:56:10Z

@hossein1387 could you provide more info on this if you still require assistance?

hossein1387 · 2019-09-11T14:59:05Z

Thanks for your responses @askhade @faxu.
Here is my platform information:

python version: 3.7.3
python build version: ('default', 'Mar 27 2019 22:11:17')
python compiler version: GCC 7.3.0
python implementation: CPython
os: Linux
os kernel version: #201806252030 SMP Tue Jun 26 00:33:17 UTC 2018
os release version: 4.17.3-041703-generic
os platform: Linux-4.17.3-041703-generic-x86_64-with-debian-stretch-sid
linux distribution: Debian
uname: uname_result(system='Linux', node='TANDEM-TL0275U', release='4.17.3-041703-generic', version='#201806252030 SMP Tue Jun 26 00:33:17 UTC 2018', machine='x86_64', processor='x86_64')
architecture: ('64bit', '')
machine: x86_64

When I check my quantized model, the onnx graph has many more operations compared to the original floating point model. I am not sure why that is and why cant we just use 8bit operations/operators, but as a results of this design choice, the 8bit model execution time is much more than the floating point model. It would be awsome if someone explain to me about these design choices.

WilliamZhaoz · 2019-09-29T08:29:13Z

I have the same problem and same question. How can I get inference acceleration on onnxruntime?

askhade · 2019-10-30T21:16:20Z

@hossein1387 :
Regarding the extra node additions : ONNX does not have a lot of quantized operators yet so we need to resort to FP32 to 8 bit conversions in between... We are planning to add more ops to the quantized ops list which will improve this situation and we are also adding fusions to fuse these extra nodes into single nodes... Both of these will reduce the number of ops we add.

Regarding achieving acceleration with quantized models... As part ort 1.0 release we added optimized kernels for matmul operations however optimized kernel work for convolutions is still in progress... Once this is done then the model should experience significant speedup than today...

hossein1387 · 2019-11-01T15:39:27Z

@askhade Thank you very much for your response.

gbolin · 2020-04-21T09:51:00Z

Thanks for the update, I ran two models with and without quantization, both models are VGG like and both are using CIFAR100, here is some results:

Model1
FP32 Quantized
Accuracy 72.28% 72.27%
Exec Time 14.4ms 53.9ms
Size 77MB 19MB
Model2
FP32 Quantized
Accuracy 73.99% 73.97%
Exec Time 7.6ms 50.29ms
Size 20MB 5.2MB
I dont understand why the quantized version is taking much longer than original model? Shouldn't the 8 bit quantized model take less than the FP32 model?

I have the same issue, quantized model double lower and take more time

datpham270198 · 2023-10-16T03:59:20Z

now onnx quantized still very slow. Does this isue solved? Please reply me. Thanks a lot. @askhade

faxu added the bug label Aug 1, 2019

askhade self-assigned this Aug 1, 2019

hossein1387 changed the title ~~Error running quantized onnx~~ Error running quantized onnx model Aug 1, 2019

askhade mentioned this issue Aug 6, 2019

update default values for weight quatization #1564

Merged

faxu added the pending label Sep 9, 2019

faxu removed the pending label Oct 1, 2019

hariharans29 added the quantization issues related to quantization label Oct 10, 2019

askhade closed this as completed Nov 6, 2019

cuixing158 mentioned this issue Nov 5, 2020

onnxruntime inference quantized-onnx model is very slow? #5708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running quantized onnx model #1543

Error running quantized onnx model #1543

hossein1387 commented Aug 1, 2019 •

edited

Loading

askhade commented Aug 1, 2019 •

edited

Loading

hossein1387 commented Aug 1, 2019

hossein1387 commented Aug 2, 2019

askhade commented Aug 2, 2019

askhade commented Aug 6, 2019

hossein1387 commented Aug 6, 2019 •

edited

Loading

hossein1387 commented Aug 16, 2019

askhade commented Aug 21, 2019

faxu commented Sep 9, 2019

hossein1387 commented Sep 11, 2019 •

edited

Loading

WilliamZhaoz commented Sep 29, 2019

askhade commented Oct 30, 2019

hossein1387 commented Nov 1, 2019

gbolin commented Apr 21, 2020

datpham270198 commented Oct 16, 2023

Error running quantized onnx model #1543

Error running quantized onnx model #1543

Comments

hossein1387 commented Aug 1, 2019 • edited Loading

askhade commented Aug 1, 2019 • edited Loading

hossein1387 commented Aug 1, 2019

hossein1387 commented Aug 2, 2019

askhade commented Aug 2, 2019

askhade commented Aug 6, 2019

hossein1387 commented Aug 6, 2019 • edited Loading

hossein1387 commented Aug 16, 2019

askhade commented Aug 21, 2019

faxu commented Sep 9, 2019

hossein1387 commented Sep 11, 2019 • edited Loading

WilliamZhaoz commented Sep 29, 2019

askhade commented Oct 30, 2019

hossein1387 commented Nov 1, 2019

gbolin commented Apr 21, 2020

datpham270198 commented Oct 16, 2023

hossein1387 commented Aug 1, 2019 •

edited

Loading

askhade commented Aug 1, 2019 •

edited

Loading

hossein1387 commented Aug 6, 2019 •

edited

Loading

hossein1387 commented Sep 11, 2019 •

edited

Loading