How to modify Detect layer to allow for converting yolov5 to Qualcomm's SNPE format? #4790

evdoks · 2021-09-14T15:32:56Z

❔Question

I am trying to convert a trained yolov5s model to an SNPE format in order to be able to run it on a Snapdragon chip. Unfortunately, Qualcomm's ONNX to SNPE converter fails on the Detect level with the following error message

ValueError: Unable to permute shape [1, 3, 64, 64, 2] to NSC ordering
2021-09-14 15:15:37,327 - 183 - ERROR - Node Mul_268: Unable to permute shape [1, 3, 64, 64, 2] to NSC ordering

I can imagine, it may have something to do with the fact that SNPE currently supports 4D input data, where the first dimension is batch SNPE doc and yolov5 Detect layer has 5D reshape.

Would it be possible to modify Detect layer so that no 5D reshape is performed?

The text was updated successfully, but these errors were encountered:

github-actions · 2021-09-14T15:33:43Z

👋 Hello @evdoks, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at [email protected].

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-09-14T16:45:14Z

@evdoks I haven't used the SNPE converter myself, so I can't help directly, but I do see Qualcomm compatibility with YOLOv5 officially mentioned here in the Snapdragon Neural Engine SDK release notes from March 2021:
https://developer.qualcomm.com/sites/default/files/docs/snpe/revision_history.html

evdoks · 2021-09-15T07:56:11Z

@glenn-jocher Thanks for the link, I saw this and this is why I was hoping that the conversion should work.

However, I was not able to find anyone who could successfully do it on a trained yolo model and there are questions on Qualcomm's dev forum from people hitting the same wall:

The conversion works, if one removes Detect layer (using --train flag in your export.py script), but then the mdel is not of much use.

glenn-jocher · 2021-09-15T08:59:32Z

@evdoks I think you are not understanding --train. All models export with all layers, there are no circumstances when export omits the Detect layer.

evdoks · 2021-09-15T09:44:37Z

@glenn-jocher, you a right, I have expressed it incorrectly, but my understanding is that when using --train flag, the exported onnx model is in training mode and I am not quite sure how can I use it for making inferences. At least in my case, the onnx model stops doing predictions if exported with --train flag, which is not the case if no training mode is set.

glenn-jocher · 2021-09-15T10:00:28Z

@evdoks yes in --train mode the grid for inference output is not constructed (as it is not needed for loss computation), so there's something isolated in that area that is causing the issue. The 5D reshape is still present in --train mode though on L55, so it's probably not the source of the problem. You might try turning self.inplace on or off to see if it has an effect.

yolov5/models/yolo.py

Lines 50 to 71 in b74dd4b

    
           def forward(self, x): 
        
               z = []  # inference output 
        
               for i in range(self.nl): 
        
                   x[i] = self.m[i](x[i])  # conv 
        
                   bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85) 
        
                   x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous() 
        
                   if not self.training:  # inference 
        
                       if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic: 
        
                           self.grid[i] = self._make_grid(nx, ny).to(x[i].device) 
        
                       y = x[i].sigmoid() 
        
                       if self.inplace: 
        
                           y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
                           y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
                       else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953 
        
                           xy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
                           wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(1, self.na, 1, 1, 2)  # wh 
        
                           y = torch.cat((xy, wh, y[..., 4:]), -1) 
        
                       z.append(y.view(bs, -1, self.no)) 
        
               return x if self.training else (torch.cat(z, 1), x)

evdoks · 2021-09-15T14:57:24Z

@glenn-jocher thanks for looking into it, but it didn't help. Neither exporting the model to onnx with --inline nor training the .pt model with inline turned on and off in the YAML file and exporting it to onnx afterward.

Qualcomm's dev forum seems to be a dead place - some people have already posted questions there regarding yolov5 compatibility but got no response.

github-actions · 2021-10-16T00:11:27Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

jayer95 · 2021-11-18T07:59:04Z

@evdoks
Hi, I am also paying attention to this issue
Do you have any progress? About YOLOv5 on SNPE

evdoks · 2021-11-18T08:37:32Z

@jayer95 unfortunately not. Switched to ResNet (which totally sucks). Let us know here if you get any breakthroughs. Qualcomm keeps updating the converter, but I haven't noticed anything that could be relevant to the issue with YOLO in release notes of the latest versions.

jayer95 · 2021-11-18T09:01:38Z

@evdoks
Thank you for your reply. I am currently experimenting closely. I have successfully converted YOLOv5.dlc, but there is currently no way to verify whether this model is available.

I don't quite understand the "--train" parameter proposed by the author of YOLOv5 to shield the 5D network model layer.

@glenn-jocher
Can I ask your opinion?

glenn-jocher · 2021-11-18T10:15:59Z

@evdoks Detect() does not have a self.train parameter. It has a self.training parameter returns the grids when training, or the sigmoid predictions during inference.

yolov5/models/yolo.py

Lines 50 to 71 in 562191f

    
           def forward(self, x): 
        
               z = []  # inference output 
        
               for i in range(self.nl): 
        
                   x[i] = self.m[i](x[i])  # conv 
        
                   bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85) 
        
                   x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous() 
        
                   if not self.training:  # inference 
        
                       if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]: 
        
                           self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i) 
        
                       y = x[i].sigmoid() 
        
                       if self.inplace: 
        
                           y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
                           y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
                       else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953 
        
                           xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
                           wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
                           y = torch.cat((xy, wh, y[..., 4:]), -1) 
        
                       z.append(y.view(bs, -1, self.no)) 
        
               return x if self.training else (torch.cat(z, 1), x)

jayer95 · 2021-11-18T10:19:20Z

@glenn-jocher
Can you help convert the DLC of SNPE? You must know better than us!!!

glenn-jocher · 2021-11-18T13:03:37Z

@jayer95 sorry, I don't actually know what DLC is. We have a mobile developer who's working on our upcoming HUB app, but for Android there we are using established TFLite export workflows and not yet targeting specific backends. What's the benefit of going this export route? Does it provide better access to NNAPI or Hexagon delegates?

If the main issue is simply the 5D nature of the tensors in Detect there's certainly a workaround you could do to handle the reshaping/permutation ops differently. You'd want to create an x-y fused grid (1d rather than 2d), and then of course also create the offsets/gains you need in 1d rather than 2d, then your tensor would be 4d (batch, anchors, xy, outputs)

hansoullee20 · 2021-12-20T01:27:34Z

@evdoks @jayer95 It is possible to convert yolov5 to .dlc format. You'd need to use version 3.1 yolov5s and specify the output nodes to the convolution layer output before the 5D reshape. Check out the SNPE release notes page 20. The exact texts are as below:
• Export the pre trained YOLO v3/v5 to ONNX

Follow the official export tutorial to obtain the ONNX model:
• https://docs.ultralytics.com/yolov5/tutorials/model_export
2.Simplify the exported ONNX model by onnx simplifier
•https://github.com/daquexian/onnx simplifier
3.Conversion: specify output nodes before 5D Reshape
•Example for YOLOv5
snpe onnx to dlc i yolov5s.onnx out_node 742
out_nodec 762 out_node 782
•Example for YOLOv3
snpe onnx to dlc i yolov3.onnx out_node 332
out_node 352 out_node 372
•Need to handle 5D Ops and postprocessing outside the model

I came as far as getting the 4D output in NativeCpp settings but made zero progress on extracting inferences. Has anyone made any progress?

wwxzxd · 2021-12-22T12:13:35Z

@evdoks 感谢您的回复。我目前正在密切试验。我已经成功转换了YOLOv5.dlc，但是目前没有办法验证这个模型是否可用。

不太明白YOLOv5作者提出的屏蔽5D网络模型层的“--train”参数。

@glenn-jocher 我可以问问你的意见吗？

Hello! It's convenient to ask how you can convert the pt model file of yolov5 into dlc. Thank you very much

glenn-jocher · 2021-12-23T13:29:33Z

@wwxzxd sorry what is dlc?

jayer95 · 2021-12-24T01:45:14Z

@glenn-jocher please refer:
https://developer.qualcomm.com/sites/default/files/docs/snpe/overview.html

glenn-jocher · 2021-12-24T13:43:14Z

@jayer95 got it, thanks!

@wwxzxd @jayer95 @evdoks The main step we could take here would be to try to add official export support for Snapdragon dlc export to export.py. We currently support 10 different model formats, and there is a system in place for export and inference of each. From TFLite, ONNX, CoreML, TensorRT Export #251:

Formats

YOLOv5 export is supported for the following formats

Format	Example	`--include ...` argument
PyTorch	yolov5s.pt	-
TorchScript	yolov5s.torchscript	`torchscript`
ONNX	yolov5s.onnx	`onnx`
CoreML	yolov5s.mlmodel	`coreml`
OpenVINO	yolov5s_openvino_model/	`openvino`
TensorFlow SavedModel	yolov5s_saved_model/	`saved_model`
TensorFlow GraphDef	yolov5s.pb	`pb`
TensorFlow Lite	yolov5s.tflite	`tflite`
TensorFlow.js	yolov5s_web_model/	`tfjs`
TensorRT	yolov5s.engine	`engine`

The fastest and easiest way to incorporate your ideas into the official codebase is to submit a Pull Request (PR) implementing your idea, and if applicable providing before and after profiling/inference/training results to help us understand the improvement your feature provides. This allows us to directly see the changes in the code and to understand how they affect workflows and performance.

Please see our ✅ Contributing Guide to get started. Thank you!

hansoullee20 · 2021-12-29T08:20:00Z

@glenn-jocher thank you for your reply. I'm somewhat relieved to know Im not alone in this search. The models are converted to .dlc format via snpe tools. (https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html)
So far snpe supports the conversion for 6 models (tensorflow, tflite, onnx, pytorch, caffe, and caffe2).

I've tried to convert the yolov5.pb model by exporting in onnx and tensorflow models. The issue rises when the converters reach the following line in yolo.py:
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

It seems as snpe has problem converting the permute function. Could this line get rewritten without using the permute function? I think as long as we pass this part, we will have the dlc model.

zhiqwang · 2021-12-29T08:26:58Z

Could this line get rewritten without using the permute function? I think as long as we pass this part, we will have the dlc model.

FYI @hansoullee20, I guess one workaround for this is just to remove this line in the Detect module when exporting the ONNX model for SNPE backend (set --train as True also), and do this function in the SNPE parts.

yolov5/models/yolo.py

Lines 53 to 54 in db6ec66

    
           bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85) 
        
           x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

hansoullee20 · 2021-12-29T08:55:33Z

@zhiqwang
but as far as I understand, the --train option in export removes the detect layer right? Wouldn't that be pointless since the detect layer will then need to be processed outside of dlc?

zhiqwang · 2021-12-29T10:05:10Z

but as far as I understand, the --train option in export removes the detect layer right? Wouldn't that be pointless since the detect layer will then need to be processed outside of dlc?

Seems that it will remove the

yolov5/models/yolo.py

Lines 56 to 68 in db6ec66

    
           if not self.training:  # inference 
        
               if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]: 
        
                   self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i) 
        
               y = x[i].sigmoid() 
        
               if self.inplace: 
        
                   y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
                   y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
               else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953 
        
                   xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
                   wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
                   y = torch.cat((xy, wh, y[..., 4:]), -1) 
        
               z.append(y.view(bs, -1, self.no))

only and return a list containing 3 intermediate head if you set --train like following

python export.py --weights path/to/your/model.pt --include onnx --simplify --train

And if SNPE doesn't support the permute op, they will have a low probability of supporting torch.meshgrid used in the _make_grid in the above line 58.

hansoullee20 · 2022-01-06T04:37:32Z

So, good news! Seems like yolov5 is now compatible with SNPE!
Pull from the master branch, export to onnx, and convert to dlc without specifying out_node.
Would appreciate any inputs on how to proceed from here in SNPE :)

jayer95 · 2022-01-10T03:25:00Z

At present, yolov5 v6.0 version can convert snpe correctly.

onnx==1.6.0
onnx-simplifier==0.3.6
onnxoptimizer==0.2.6
onnxruntime==1.1.0
scikit-learn==0.19.2
numpy==1.19.5
protobuf==3.17.3
torch==1.10.0

git clone https://github.com/ultralytics/yolov5.git
cd yolov5
git checkout v6.0

python export.py --weights yolov5n.pt --optimize --opset 11 --simplify

Please use Netron to view the exported yolov5n.onnx, you will find that the layer above the 5D output nodes is the 4D output nodes: Conv_198, Conv_232, Conv_266, then the output nodes are: 326, 379, 432, so we need to specify these 3 output nodes when converting yolov5n.dlc.

But at present, a program is still needed to demo the converted yolov5n.dlc.
The most important thing is that the inference program must contain the "letterbox" preprocessing algorithm of yolov5 to ensure that "letterbox" is used in training and also in inference.

zhiqwang · 2022-01-10T03:37:21Z

Hi @jayer95 ,

Please use Netron to view the exported yolov5n.onnx, you will find that the previous layer of reshape into 5D output is: Conv_198, Conv_232, Conv_266, then the output nodes are: 326, 379, 432, so we need to specify these 3 output nodes when converting yolov5n.dlc.

I have a question here, seems that the anchor decoder part in Detect at below will not make sense if we specify the output nodes to 326, 379, 432 ?

yolov5/models/yolo.py

Lines 57 to 68 in 6865d19

    
           if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]: 
        
               self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i) 
        
           y = x[i].sigmoid() 
        
           if self.inplace: 
        
               y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
               y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
           else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953 
        
               xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy 
        
               wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh 
        
               y = torch.cat((xy, wh, y[..., 4:]), -1) 
        
           z.append(y.view(bs, -1, self.no))

jayer95 · 2022-01-10T07:31:57Z

Hi @zhiqwang ,

Is the reason why you converted to yolov5n.dlc because you want to load yolov5.dlc on the SNPE SDK and output the post-processing result?

The correct conversion steps should be as follows:
yolov5n.pt --> yolov5n.onnx --> yolov5n.dlc

When we converted yolov5n.onnx to yolov5n.dlc, we specified 3 output nodes: 326, 379, 432 (Conv_198, Conv_232, Conv_266), as shown below,

For nodes with 3 outputs: Conv_198, Conv_232, Conv_266, for 4D outputs specified by SNPE, please refer to:
https://developer.qualcomm.com/sites/default/files/docs/snpe//image_input.html

SNPE 4D image output format is:
batch_size * grid_size * grid_size * (3 * (box_size + score_size + class_size))

batch_size=1, box_size=4, score_size=1, class_size=80

Conv_266 node: 1x20x20x255 (grid_size=20)
Conv_232 node: 1x40x40x255 (grid_size=40)
Conv_198 node: 1x80x80x255 (grid_size =80)

At this time, it has been converted to yolov5n.dlc. The post-processing program for parsing yolov5n.dlc should be developed in C++ on the SNPE SDK or QCS devices. It has nothing to do with the post-processing of "yolov5/models/yolo.py".

I'm using SNPE SDK 1.58 (the latest version at present), when converting yolov5n.dlc, I use "snpe-onnx-to-dlc" under the x86 architecture for model conversion, and use "snpe-dlc-info" to view the model architecture of yolov5n.dlc.

Hi @glenn-jocher ,
Let's discuss the yolov5 dlc model supported by Qualcomm SNPE.

fwzdev1 · 2022-03-03T10:28:59Z

@fwzdev1

Congratulations on your successful conversion, the rest is how to parse yolov5.dlc on SNPE SDK.

Thank you for your sharing. @jayer95

I was stuck in using dsp/aip to run the network because of the unsupported 5-dimension reshape operation, and the speed in CPU (100+ms) is totally unacceptable. After asking Qualcomm stuff, it comes the bad news that there's no option but to change yolov5 network, especially the detect head part. Which is a little bit tricky.

JISHNUSHAJI · 2022-03-10T05:41:18Z

I have converted the yolov5 model to dlc,now i have to do the 5d reshape and other post processing outside the model.Could someone share the code for post processing from 5d reshape onwards?

eeyzl5 · 2022-06-07T16:28:32Z

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index

# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified

In _make_grid, also delete the batchsize part of all 5d tensors

# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified

No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.

# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:

# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified

Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

glenn-jocher · 2022-06-07T19:00:45Z

@eeyzl5 awesome, thanks for the detailed feedback!

JISHNUSHAJI · 2022-06-09T06:10:26Z

@eeyzl5 thanks for the detailed explanation

hansoullee20 · 2022-07-21T09:12:40Z

@eeyzl5

Thank you for sharing the details with us.

I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.

 Epoch   gpu_mem       box       obj       cls    labels  img_size

0%| | 0/44 [00:02<?, ?it/s]
Traceback (most recent call last):
File "./yolov5/train.py", line 643, in
main(opt)
File "./yolov5/train.py", line 539, in main
train(opt.hyp, opt, device, callbacks)
File "./yolov5/train.py", line 330, in train
pred = model(imgs) # forward
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/tmp/yolov5/models/yolo.py", line 128, in forward
return self._forward_once(x, profile, visualize) # single-scale inference, train
File "/root/tmp/yolov5/models/yolo.py", line 151, in _forward_once
x = m(x) # run
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/tmp/yolov5/models/yolo.py", line 55, in forward
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified
RuntimeError: shape '[3, 6, 80, 80]' is invalid for input of size 921600

Have you made any other modifications you haven't shared with us?

Thank you for reading.

eeyzl5 · 2022-07-21T09:32:41Z

@eeyzl5

Thank you for sharing the details with us.

I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.
 Epoch   gpu_mem       box       obj       cls    labels  img_size
0%| | 0/44 [00:02<?, ?it/s] Traceback (most recent call last): File "./yolov5/train.py", line 643, in main(opt) File "./yolov5/train.py", line 539, in main train(opt.hyp, opt, device, callbacks) File "./yolov5/train.py", line 330, in train pred = model(imgs) # forward File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/tmp/yolov5/models/yolo.py", line 128, in forward return self._forward_once(x, profile, visualize) # single-scale inference, train File "/root/tmp/yolov5/models/yolo.py", line 151, in _forward_once x = m(x) # run File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/tmp/yolov5/models/yolo.py", line 55, in forward x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified RuntimeError: shape '[3, 6, 80, 80]' is invalid for input of size 921600

Have you made any other modifications you haven't shared with us?

Thank you for reading.

@hansoullee20

Hi, just to remind that the modifications are only valid in deployment steps after you've already got a trained model and you wish to export this model to SNPE compatible format. So you just keep the original code while training, then apply the modifications when you gonna export to onnx format (refer to this), and then export to dlc format.

hansoullee20 · 2022-07-29T12:51:58Z

@eeyzl5

Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past?

When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised.

If you have any recommendations would be much appriciated.
Than you in advance.

ravineti · 2022-07-30T12:32:54Z

Hi - Any reference implementation to integrate YoloV5 in Android APP using SNPE ?

We are able to successfully convert the DLC, and run on Snapdragon device, using ARM CPU, GPU, and DSP runtimes.
However, looking for any pre-post processing reference code in JNI or Java ?

eeyzl5 · 2022-08-08T08:08:44Z

@eeyzl5

Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past?

When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised.

If you have any recommendations would be much appriciated. Than you in advance.

@hansoullee20

Hi, I was able to get correct detections. If you try to run with DSP please refer to "Running with DSP" section from my above comment. Otherwise you may not get the correct result especially if you don't do post-processing on cpu. Post-processing includes all the operations after 5d reshape. Again you may refer to my sample codes. My suggestion is to start with the default official model and compare the raw values output from PC and your snpe device.

rszeto-sy · 2022-09-30T13:24:23Z

I ran into a problem running the code from @eeyzl5's detailed answer (linked here for brevity) but found a likely solution. There's a bug in the "Running with DSP section" where, in the final if statement, the grid location and anchors are not set correctly. The posted version only subtracts num_filters[1] from ci, whereas it should subtract (num_filters[1] + num_filters[0]) so that all grid locations and anchors are sampled correctly. This is what the final if statement should look like:

gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
stride = 32;

This is what it looks like inside the entire code snippet:

float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}

And since I happened to need this in Python, here's that too in case it's useful (it returns a copy instead of operating in-place):

def postprocess_raw_output(
        values,
        anchorX=[10,16,33,30,62,59,116,156,373],
        anchorY=[13,30,23,61,45,119,90,198,326],
        num_filters=[19200,4800,1200],
        filter_size=[80,40,20],
        last_dim_size=85
    ):

    ret = values.copy()

    for c in range(4, values.size, last_dim_size):
        cx = values[c-4]
        cy = values[c-3]
        w = values[c-2]
        h = values[c-1]

        ci = int(c / last_dim_size)
        if ci < num_filters[0]:
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0]
            gridY = int((ci%(filter_size[0]*filter_size[0]))/filter_size[0])
            anchor_gridX = anchorX[int(ci/(filter_size[0]*filter_size[0]))]
            anchor_gridY = anchorY[int(ci/(filter_size[0]*filter_size[0]))]
            stride = 8
        elif ci>=num_filters[0] and ci<(num_filters[0]+num_filters[1]):
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1]
            gridY = int(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1])
            anchor_gridX = anchorX[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
            anchor_gridY = anchorY[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
            stride = 16
        else:
            gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2]
            gridY = int(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2])
            anchor_gridX = anchorX[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
            anchor_gridY = anchorY[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
            stride = 32

        cx = float(cx*2-0.5+gridX)*stride
        cy = float(cy*2-0.5+gridY)*stride
        w = w*2*w*2*anchor_gridX
        h = h*2*h*2*anchor_gridY
        ret[c-4:c] = [cx, cy, w, h]

    return ret

Hopefully this is of use particularly to @hansoullee20.

fwzdev1 · 2022-10-13T12:48:08Z

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index
# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified
In _make_grid, also delete the batchsize part of all 5d tensors
# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified
No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.
# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)
Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.
float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}
The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:
# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified
Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

Thank you so much for such a detailed reply...or even a report!

I was doing exactly the same thing as you did, which is to detect a single class of objects smaller than 10 * 10. What I did for the network is almost the same as you explained, except changing silu to leakyrelu (I just make it relu instead). I've tested yolov5n with backbone scale = 0.2 or 0.3, and it took about 8-10 for networks in DSP (snapdragon 855, 640*640). I finally choose nanodet plus with the same custom changings for more convenient pre and post-process codes (the repository of nanodet contains official pre and post-process code for SNPE).

But there is another problem. It took me about 10ms for network inference, which is acceptable. When it comes to pre and post-process, things changed. I remember it was about 3ms for pre and 4ms for post, so the total time was over 17ms for a single image.
Compare with using CPU(1 ms pre, 1ms post), it took 2 or 3 ms more on data transmission and computing (in OpenCV) using DSP.

I wonder whether you had bothered by this issue or not.

Finally, great appreciation for your sharing!

saadwaraich1 · 2022-12-04T00:26:36Z

Hi all,

I convert yolov5n.pt --imgsz 320 to yolov5n.onnx --imgsz 192 320 (the concept of letterbox), and then use SNPE 1.58 to convert yolov5n.dlc --imgsz 192 320(the concept of letterbox).

I use "gst-launch-1.0 / qtimlesnpe " to parse yolov5n.dlc to demo, the detection effect is very good, close to lossless conversion.

image:

video: https://drive.google.com/file/d/1-eEi8dkh_3mLxd3CEnRpqPFLJJ4G5FPH/view?usp=sharing

Hey, thanks for the help. I am able to convert they way you mentioned. I am trying to demo using gstreamer and qtimlesnpe. I can see model running as pipeline is taking some time, but no bounding boxes on the video. I have seen this behavior before and going to a previous version of libqtioverlay.so solved the issue. Not sure how it solved but it worked. @jayer95 @Mohit-Ak Any idea how I can deal with it or maybe if you can somehow find what version libqtioverlay.so library was used on your end ?
Thanks

jayer95 · 2022-12-04T08:39:38Z

@saadwaraich1
Hi, thank you for your reply,

The codes of libqtioverlay.so and other qtimlesnpe plugins need to be rewritten and covered.
The main thing is to write the code for parsing the 4D format output of yolov5.

Are you a Qualcomm chip buyer? Please contact Qualcomm’s customer support directly and ask Qualcomm’s technical staff by raising a case.

wofvh · 2023-05-11T02:06:19Z

So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :)
@hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ?

hansoullee20 · 2023-05-11T02:13:14Z

안녕하세요 어디까지 실행해 보셨나요? 위 스레드에서 구체적으로 이해가 안가시는 부분이 있으신가요?

…

On Thu, May 11, 2023, 11:06 AM teddy ***@***.***> wrote: So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 <https://github.com/hansoullee20> hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ? — Reply to this email directly, view it on GitHub <#4790 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

glenn-jocher · 2023-05-11T02:50:49Z

@wofvh 안녕하세요,

해당 정보에 대해 감사드립니다. 제가 이해하기로는 yolov5가 SNPE와 호환되어 문제가 해결된 것 같습니다. 어떻게 진행해야 하는지에 대해 추가적인 정보를 공유해주시면 감사하겠습니다.

감사합니다.

wofvh · 2023-05-11T03:04:30Z

안녕하세요 어디까지 실행해 보셨나요? 위 스레드에서 구체적으로 이해가 안가시는 부분이 있으신가요?
…
On Thu, May 11, 2023, 11:06 AM teddy @.> wrote: So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 https://github.com/hansoullee20 hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ? — Reply to this email directly, view it on GitHub <#4790 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA . You are receiving this because you were mentioned.Message ID: @.>

@hansoullee20 답변정말 감사드립니다 지금 현재 qualcomm 튜토리얼보고 리눅스 18.04 _x86_64 에서 inceptionv3 모델을 dlc 로 바꾸고 다시 양자화 해서 무게를 줄이는거 까지는 완료 했습니다 지금 저에게 yolov5 .onnx 가중치가 있는데 이걸 snpe-sdk 에서 어떻게 dlc로 바꿔서 스넵드레곤이 아닌 qualcomm RB5에 실행시킬수있는지 궁금합니다 ! 도와주시면 정말 감사하겠습니다

glenn-jocher · 2023-05-11T03:27:53Z

@hansoullee20 안녕하세요,

좋은 소식입니다. yolov5가 SNPE와 호환되어 문제가 해결된 것으로 보입니다. 하지만, 제가 직접 구체적으로 실행하지는 않았기 때문에 어떻게 진행해야 하는지에 대해서는 정확한 정보를 제공드리기 어려울 것 같습니다.

추가적인 정보가 필요하시다면, 코드를 분석하거나 질문과 관련된 문서를 참고하시는 것이 좋을 것입니다. 좀 더 구체적인 질문이 있다면 언제든지 문의해 주세요.

감사합니다.

hansoullee20 · 2023-05-11T06:09:27Z

qualcomm RB5에서 실행시킬수있는지는 잘 모르겠습니다. 양자화 까지 완료 하셧다면 혹시 5D 에서 4D변환후 DLC 생성하고 양자화 하셧나요? 혹은 onnx에서 dlc로 변환하시는게 궁금하신건가요?

…

On Thu, May 11, 2023, 12:04 PM teddy ***@***.***> wrote: 안녕하세요 어디까지 실행해 보셨나요? 위 스레드에서 구체적으로 이해가 안가시는 부분이 있으신가요? … <#m_-356277837338867266_m_-2144071634188085371_> On Thu, May 11, 2023, 11:06 AM teddy *@*.*> wrote: So, good news! Seems like yolov5 is now compatible with SNPE! Pull from the master branch, export to onnx, and convert to dlc without specifying out_node. Would appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20 <https://github.com/hansoullee20> https://github.com/hansoullee20 <https://github.com/hansoullee20> hi hansoul im also trying to run yolov5 on snpe-sdk may i eamil to you ? — Reply to this email directly, view it on GitHub <#4790 (comment) <#4790 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA <https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA> . You are receiving this because you were mentioned.Message ID: @.*> @hansoullee20 <https://github.com/hansoullee20> 답변정말 감사드립니다 지금 현재 qualcomm 튜토리얼보고 리눅스 18.04 _x86_64 에서 inceptionv3 모델을 dlc 로 바꾸고 다시 양자화 해서 무게를 줄이는거 까지는 완료 했고 저에게 yolov5 .onnx 가중치가 있는데 이걸 snpe-sdk 에서 어떻게 dlc로 바꿔서 스넵드레곤이 아닌 qualcomm RB5에 실행시킬수있는지 궁금합니다 ! — Reply to this email directly, view it on GitHub <#4790 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJQ7PGFPXBXOYA62GHITMXLXFRJMVANCNFSM5EAN3ODA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

wofvh · 2023-05-11T06:18:10Z

@hansoullee20 qualcomm sdk 페이지에 나와있는 기본적인 방법으로 inception_v3모델을 5D에서 4D 변환후 양자화 까지는 했는데 YOLOV5가중치는 어떻게 dlc파일로 변환후 양자화는법을 잘 모르겠습니다 제가 이쪽에 대해 많이 부족해서 조금 쉽게 알려주시면 감사하겠습니다 onnx 파일까진 변환이 되어있는 상태입니다 감사합니다

현재 우분투 18.04 파이썬 3.6.9

glenn-jocher · 2023-05-11T09:08:05Z

Hello @wofvh,

I have been following the Qualcomm SDK page and was able to convert the Inception_v3 model from 5D to 4D and perform quantization successfully. However, I am having difficulty with converting YOLOv5 weights to a dlc file and then performing quantization. As I am relatively new to this topic, I was hoping you could provide some guidance on how to proceed with these steps in a relatively easy-to-understand manner. Currently, the onnx file has been converted already. Thank you for your help.

Best,

aleshem · 2024-03-06T15:09:06Z

Hi Glenn
I managed to convert to dlc and run after quantization
I put my changes in this fork yolov5_snpe_conversion
However, I have been having some trouble quantizing the network using real images. the results look worse than having zeros as the input bin files.
Does anyone have any experience with this and know what would be good practices for this conversion?
Thanks

glenn-jocher · 2024-03-06T20:05:20Z

@aleshem hi there,

It's great to hear that you've managed to convert to dlc and run after quantization. Regarding the issues you're facing with quantizing the network using real images, it's not uncommon to encounter challenges during this process. Quantization can sometimes lead to a degradation in model performance, especially if the quantization process or the selection of calibration images is not optimal.

Here are a few general tips that might help improve the quantization results:

Calibration Dataset: Ensure that the dataset used for calibration is representative of the actual use case and diverse enough to cover various scenarios the model will encounter.
Quantization Strategy: Experiment with different quantization strategies. For instance, symmetric vs. asymmetric quantization, per-channel vs. per-tensor quantization, and so on.
Model Fine-tuning: After quantization, it might be beneficial to fine-tune the model with a small learning rate for a few epochs to regain some of the lost accuracy.
Quantization-aware Training: If possible, consider quantization-aware training, where the model is trained with simulated quantization, making it more robust to the effects of quantization.

Unfortunately, without more specific details, it's challenging to provide more targeted advice. I recommend reviewing the documentation and resources available for the specific quantization tools you're using, as they might offer insights or best practices specific to their methodology.

Remember, the community and forums dedicated to the specific tools or frameworks you're using can also be valuable resources for advice and troubleshooting.

Best of luck with your quantization efforts, and feel free to reach out if you have more specific questions or issues.

Best regards.

BaoHaoo · 2024-04-14T07:53:55Z

@JISHNUSHAJI @glenn-jocher @fwzdev1

Hi all, just to share my recent exploration of running yolov5 with SNPE.

I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible.

Least Modification

Since the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize.

The modification in the Detect() module in models/yolo.py:

In forward(), simply delete bs and change the permute index
# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()  # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous()  # modified
In _make_grid, also delete the batchsize part of all 5d tensors
# grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float()  #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float()  #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float()    #original 
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float()  #modified
No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Detect() layer, so no need to reimplement this detection part outside the model. Just apply confidence selection and nms then you can get the bounding boxes.

Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Detect() layer as far as I am concerned.

Running with DSP

If you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable.

The modification in the Detect module is mainly to comment out these codes.
# if self.inplace:
#     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
# else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
#     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
#     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
#     y = torch.cat((xy, wh, y[..., 4:5]), -1)
Also remember to include the change of 5d reshape to 4d discussed above.

The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented.

Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc.
float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
    if (values[c]>=conf) {
        float cx = values[c-4];
        float cy = values[c-3];
        float w = values[c-2];
        float h = values[c-1];
        int gridX, gridY;
        int anchor_gridX, anchor_gridY;
        int[] anchorX = {10,16,33,30,62,59,116,156,373};
        int[] anchorY = {13,30,23,61,45,119,90,198,326};
        int[] num_filters = {19200,4800,1200};
        int[] filter_size = {80,40,20};
        int stride;
        int ci = (int)(c/85);
        if (ci<num_filters[0]) {
            gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
            gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
            anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
            anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
            stride = 8;
        } else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
            gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
            gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
            anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
            stride = 16;
        } else {
            gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
            gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
            anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
            stride = 32;
        }
        cx = (float)(cx*2-0.5+gridX)*stride;
        cy = (float)(cy*2-0.5+gridY)*stride;
        w = w*2*w*2*anchor_gridX;
        h = h*2*h*2*anchor_gridY;
        float left = cx-w/2;
        float top = cy-h/2;
        float right = cx+w/2;
        float bottom = cy+h/2;
        float obj_conf = values[c];
    }
}
The location of bounding boxes are represented by left, top, right, bottom and confidence by obj_conf, these can be provided to nms functions to get clean boxes. The parsing of class confidence and class index are not provided here cause they are not relevant to my task, but could be easily extracted using some sort of max and argmax functions.

In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed.

Optimization for SNPE

When looking into the yolov5 models released recently, the activation layer used after each convolution is nn.SiLU(). However, both onnx and SNPE do not support SiLU activation directly, but splitting to separate operations of Sigmoid and Multiplication. Which means that SNPE currently does not optimize the execution of SiLU layers, and this apparently slows down entire execution of the network as you can see there are 50+ activation layers in yolov5s model!

Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base Conv() module in models/common.py:
# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())  #modified
Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off.

For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing.

After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one.

Even faster execution time may achieve by switching to C++ and yolov5n model.

Good luck!

Hello, thank you very much for your detailed report. Based on your work, I further tested the quantized YOLOv5 model running on the Qualcomm SNPE DSP. I found that in the final output tensor, the object detection boxes could be detected properly, but strangely their confidence scores (which is y[4:6] in the code below) were very low. Combining with SNPE's quantization tool "snpe-dlc-quantize", I speculate that this is because SNPE adopts a rather basic method of post-training quantization, which is 𝑄 = round((FP32 − Min)/Scale Factor) + Zero Point.

if self.inplace:
     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
 else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
     y = torch.cat((xy, wh, y[..., 4:5]), -1)

Before the code block, xy, wh, and conf are tensors ranging from 0 to 1, so their quantization does not cause much loss of accuracy. However, after this code block, xy scales to a range approximately equal to the size of the image, wh scales to the size of the target in pixel range, but only confidence remains within the range of 0 to 1.

After this code block, all these variables including xy, wh, and conf are concatenated into one vector. For SNPE, one vector shares one quantization parameter. During quantization, the quantization scale is equivalent to the tensor with the largest change in value. Corresponding to xy, wh, and conf, the xy with the largest change will be used as the quantization scale. This scale will cause severe loss in the range of variables, and the confidence will tend to zero. This is also the reason why there is no output after quantization. For example, for an image of size 640x640, using Int8 quantization, the Scale Factor is 640/256 = 2.5, which is even larger than the entire range of conf. This is the fundamental reason for the significant loss of quantization accuracy.

To address this issue, the strategy I employed here is a very naive one: I multiply the confidence scores (conf) by a coefficient to scale them along with xy and wh to the same range. This prevents excessive loss of precision during quantization. After the final model output, I divide it by the same coefficient to obtain normal detection confidence.

 if self.inplace:
     y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
 else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
     xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
     wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
     conf = y[..., 4:6]*1280 # 1280 is the max length of the input size, which can be the width of the input image
     y = torch.cat((xy, wh, conf), -1)

By using this method, adjustments only need to be made to the confidence scores at the end, without the need to add additional code elsewhere. Through testing, I found that the quantization accuracy loss resulting from this approach is acceptable.

glenn-jocher · 2024-04-14T12:31:53Z

Hello,

Thanks for sharing your exploration and solution to the quantization accuracy loss issue when running YOLOv5 with SNPE. It's insightful to see how adjusting the range of the confidence scores can help mitigate precision loss due to quantization. Your approach to scale the confidence scores to be in line with other tensor ranges and then scaling back for final output is a clever workaround. This strategy could be beneficial for others facing similar quantization challenges.

It's always exciting to see community members contributing novel solutions to complex problems. Keep up the good work, and thank you for contributing to the broader knowledge base around YOLOv5 and SNPE!

aleshem · 2024-04-15T09:45:43Z

BTW, I managed to convert yolov8 using SNPE 2.10 (opset=12 in export) without any major changes. in case your chip supports this version it may save you a lot of time

glenn-jocher · 2024-04-15T14:07:10Z

Hey there!

That's fantastic news! 🎉 It's always great to hear about smooth conversions, especially with newer versions like YOLOv8 and SNPE 2.10. Your sharing could indeed save a lot of time for many in the community working on similar projects. If there are specific steps or minor tweaks that helped you along the way, feel free to drop those details. Every bit helps! Thanks for sharing, and happy coding! 👍

aleshem · 2024-04-16T05:41:30Z

The major problem at the moment, is that the quantization doesn't work well for yolov8-pose, for some reason it ruins the confidence

glenn-jocher · 2024-04-16T09:28:23Z

Hello!

Thanks for reaching out about the quantization issue with YOLOv8-pose. It's not uncommon for quantization to affect model confidence, as precision loss can significantly impact the network's output. 🤔

A potential approach is to experiment with different quantization techniques or tools that might offer better control over precision loss. Considering calibration datasets that closely represent your use case might also help mitigate this issue. It's all about finding the right balance for your specific scenario.

If this doesn't resolve the problem, could you share more details about the quantization method you're using? This info might provide further insights for troubleshooting.

Best regards!

evdoks added the question Further information is requested label Sep 14, 2021

github-actions bot added the Stale Stale and schedule for closing soon label Oct 16, 2021

github-actions bot closed this as completed Oct 21, 2021

How to modify Detect layer to allow for converting yolov5 to Qualcomm's SNPE format? #4790

How to modify Detect layer to allow for converting yolov5 to Qualcomm's SNPE format? #4790

Comments

evdoks commented Sep 14, 2021

❔Question

github-actions bot commented Sep 14, 2021 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

glenn-jocher commented Sep 14, 2021

evdoks commented Sep 15, 2021 • edited Loading

glenn-jocher commented Sep 15, 2021

evdoks commented Sep 15, 2021 • edited Loading

glenn-jocher commented Sep 15, 2021

evdoks commented Sep 15, 2021 • edited Loading

github-actions bot commented Oct 16, 2021 • edited by glenn-jocher Loading

jayer95 commented Nov 18, 2021

evdoks commented Nov 18, 2021

jayer95 commented Nov 18, 2021 • edited Loading

glenn-jocher commented Nov 18, 2021 • edited Loading

jayer95 commented Nov 18, 2021

glenn-jocher commented Nov 18, 2021

hansoullee20 commented Dec 20, 2021 • edited by glenn-jocher Loading

wwxzxd commented Dec 22, 2021

glenn-jocher commented Dec 23, 2021

jayer95 commented Dec 24, 2021

glenn-jocher commented Dec 24, 2021 • edited Loading

Formats

hansoullee20 commented Dec 29, 2021

zhiqwang commented Dec 29, 2021 • edited Loading

hansoullee20 commented Dec 29, 2021

zhiqwang commented Dec 29, 2021 • edited Loading

hansoullee20 commented Jan 6, 2022

jayer95 commented Jan 10, 2022 • edited Loading

zhiqwang commented Jan 10, 2022

jayer95 commented Jan 10, 2022 • edited Loading

fwzdev1 commented Mar 3, 2022 • edited Loading

JISHNUSHAJI commented Mar 10, 2022

eeyzl5 commented Jun 7, 2022

Least Modification

Running with DSP

Optimization for SNPE

glenn-jocher commented Jun 7, 2022

JISHNUSHAJI commented Jun 9, 2022

hansoullee20 commented Jul 21, 2022

eeyzl5 commented Jul 21, 2022 • edited by glenn-jocher Loading

hansoullee20 commented Jul 29, 2022

ravineti commented Jul 30, 2022

eeyzl5 commented Aug 8, 2022

rszeto-sy commented Sep 30, 2022

fwzdev1 commented Oct 13, 2022

Least Modification

Running with DSP

Optimization for SNPE

saadwaraich1 commented Dec 4, 2022

jayer95 commented Dec 4, 2022

wofvh commented May 11, 2023

hansoullee20 commented May 11, 2023 via email

glenn-jocher commented May 11, 2023

wofvh commented May 11, 2023 • edited Loading

glenn-jocher commented May 11, 2023

hansoullee20 commented May 11, 2023 via email

wofvh commented May 11, 2023

glenn-jocher commented May 11, 2023

aleshem commented Mar 6, 2024

glenn-jocher commented Mar 6, 2024

BaoHaoo commented Apr 14, 2024 • edited Loading

Least Modification

Running with DSP

Optimization for SNPE

glenn-jocher commented Apr 14, 2024

aleshem commented Apr 15, 2024

glenn-jocher commented Apr 15, 2024

aleshem commented Apr 16, 2024

glenn-jocher commented Apr 16, 2024

github-actions bot commented Sep 14, 2021 •

edited by UltralyticsAssistant

Loading

evdoks commented Sep 15, 2021 •

edited

Loading

evdoks commented Sep 15, 2021 •

edited

Loading

evdoks commented Sep 15, 2021 •

edited

Loading

github-actions bot commented Oct 16, 2021 •

edited by glenn-jocher

Loading

jayer95 commented Nov 18, 2021 •

edited

Loading

glenn-jocher commented Nov 18, 2021 •

edited

Loading

hansoullee20 commented Dec 20, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Dec 24, 2021 •

edited

Loading

zhiqwang commented Dec 29, 2021 •

edited

Loading

zhiqwang commented Dec 29, 2021 •

edited

Loading

jayer95 commented Jan 10, 2022 •

edited

Loading

jayer95 commented Jan 10, 2022 •

edited

Loading

fwzdev1 commented Mar 3, 2022 •

edited

Loading

eeyzl5 commented Jul 21, 2022 •

edited by glenn-jocher

Loading

wofvh commented May 11, 2023 •

edited

Loading

BaoHaoo commented Apr 14, 2024 •

edited

Loading