-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to modify Detect layer to allow for converting yolov5 to Qualcomm's SNPE format? #4790
Comments
👋 Hello @evdoks, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at [email protected]. RequirementsPython>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started: $ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
@evdoks I haven't used the SNPE converter myself, so I can't help directly, but I do see Qualcomm compatibility with YOLOv5 officially mentioned here in the Snapdragon Neural Engine SDK release notes from March 2021: |
@glenn-jocher Thanks for the link, I saw this and this is why I was hoping that the conversion should work. However, I was not able to find anyone who could successfully do it on a trained yolo model and there are questions on Qualcomm's dev forum from people hitting the same wall: The conversion works, if one removes Detect layer (using |
@evdoks I think you are not understanding --train. All models export with all layers, there are no circumstances when export omits the Detect layer. |
@glenn-jocher, you a right, I have expressed it incorrectly, but my understanding is that when using |
@evdoks yes in --train mode the grid for inference output is not constructed (as it is not needed for loss computation), so there's something isolated in that area that is causing the issue. The 5D reshape is still present in --train mode though on L55, so it's probably not the source of the problem. You might try turning self.inplace on or off to see if it has an effect. Lines 50 to 71 in b74dd4b
|
@glenn-jocher thanks for looking into it, but it didn't help. Neither exporting the model to onnx with Qualcomm's dev forum seems to be a dead place - some people have already posted questions there regarding yolov5 compatibility but got no response. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
@evdoks |
@jayer95 unfortunately not. Switched to ResNet (which totally sucks). Let us know here if you get any breakthroughs. Qualcomm keeps updating the converter, but I haven't noticed anything that could be relevant to the issue with YOLO in release notes of the latest versions. |
@evdoks I don't quite understand the "--train" parameter proposed by the author of YOLOv5 to shield the 5D network model layer. @glenn-jocher |
@evdoks Detect() does not have a Lines 50 to 71 in 562191f
|
@glenn-jocher |
@jayer95 sorry, I don't actually know what DLC is. We have a mobile developer who's working on our upcoming HUB app, but for Android there we are using established TFLite export workflows and not yet targeting specific backends. What's the benefit of going this export route? Does it provide better access to NNAPI or Hexagon delegates? If the main issue is simply the 5D nature of the tensors in Detect there's certainly a workaround you could do to handle the reshaping/permutation ops differently. You'd want to create an x-y fused grid (1d rather than 2d), and then of course also create the offsets/gains you need in 1d rather than 2d, then your tensor would be 4d (batch, anchors, xy, outputs) |
@evdoks @jayer95 It is possible to convert yolov5 to .dlc format. You'd need to use version 3.1 yolov5s and specify the output nodes to the convolution layer output before the 5D reshape. Check out the SNPE release notes page 20. The exact texts are as below:
I came as far as getting the 4D output in NativeCpp settings but made zero progress on extracting inferences. Has anyone made any progress? |
Hello! It's convenient to ask how you can convert the pt model file of yolov5 into dlc. Thank you very much |
@wwxzxd sorry what is dlc? |
@jayer95 got it, thanks! @wwxzxd @jayer95 @evdoks The main step we could take here would be to try to add official export support for Snapdragon dlc export to export.py. We currently support 10 different model formats, and there is a system in place for export and inference of each. From TFLite, ONNX, CoreML, TensorRT Export #251: FormatsYOLOv5 export is supported for the following formats
The fastest and easiest way to incorporate your ideas into the official codebase is to submit a Pull Request (PR) implementing your idea, and if applicable providing before and after profiling/inference/training results to help us understand the improvement your feature provides. This allows us to directly see the changes in the code and to understand how they affect workflows and performance. Please see our ✅ Contributing Guide to get started. Thank you! |
@glenn-jocher thank you for your reply. I'm somewhat relieved to know Im not alone in this search. The models are converted to .dlc format via snpe tools. (https://developer.qualcomm.com/sites/default/files/docs/snpe/tools.html) I've tried to convert the yolov5.pb model by exporting in onnx and tensorflow models. The issue rises when the converters reach the following line in yolo.py: It seems as snpe has problem converting the permute function. Could this line get rewritten without using the permute function? I think as long as we pass this part, we will have the dlc model. |
FYI @hansoullee20, I guess one workaround for this is just to remove this line in the Lines 53 to 54 in db6ec66
|
@zhiqwang |
Seems that it will remove the Lines 56 to 68 in db6ec66
only and return a list containing 3 intermediate head if you set --train like following python export.py --weights path/to/your/model.pt --include onnx --simplify --train And if SNPE doesn't support the |
So, good news! Seems like yolov5 is now compatible with SNPE! |
At present, yolov5 v6.0 version can convert snpe correctly. onnx==1.6.0 git clone https://github.com/ultralytics/yolov5.git python export.py --weights yolov5n.pt --optimize --opset 11 --simplify Please use Netron to view the exported yolov5n.onnx, you will find that the layer above the 5D output nodes is the 4D output nodes: Conv_198, Conv_232, Conv_266, then the output nodes are: 326, 379, 432, so we need to specify these 3 output nodes when converting yolov5n.dlc. But at present, a program is still needed to demo the converted yolov5n.dlc. |
Hi @jayer95 ,
I have a question here, seems that the anchor decoder part in Detect at below will not make sense if we specify the output nodes to 326, 379, 432 ? Lines 57 to 68 in 6865d19
|
Hi @zhiqwang , Is the reason why you converted to yolov5n.dlc because you want to load yolov5.dlc on the SNPE SDK and output the post-processing result? The correct conversion steps should be as follows: When we converted yolov5n.onnx to yolov5n.dlc, we specified 3 output nodes: 326, 379, 432 (Conv_198, Conv_232, Conv_266), as shown below, For nodes with 3 outputs: Conv_198, Conv_232, Conv_266, for 4D outputs specified by SNPE, please refer to: SNPE 4D image output format is: batch_size=1, box_size=4, score_size=1, class_size=80 Conv_266 node: 1x20x20x255 (grid_size=20) At this time, it has been converted to yolov5n.dlc. The post-processing program for parsing yolov5n.dlc should be developed in C++ on the SNPE SDK or QCS devices. It has nothing to do with the post-processing of "yolov5/models/yolo.py". I'm using SNPE SDK 1.58 (the latest version at present), when converting yolov5n.dlc, I use "snpe-onnx-to-dlc" under the x86 architecture for model conversion, and use "snpe-dlc-info" to view the model architecture of yolov5n.dlc. Hi @glenn-jocher , |
Thank you for your sharing. @jayer95 I was stuck in using dsp/aip to run the network because of the unsupported 5-dimension reshape operation, and the speed in CPU (100+ms) is totally unacceptable. After asking Qualcomm stuff, it comes the bad news that there's no option but to change yolov5 network, especially the detect head part. Which is a little bit tricky. |
I have converted the yolov5 model to dlc,now i have to do the 5d reshape and other post processing outside the model.Could someone share the code for post processing from 5d reshape onwards? |
@JISHNUSHAJI @glenn-jocher @fwzdev1 Hi all, just to share my recent exploration of running yolov5 with SNPE. I am using SNPE v1.62 and yolov5 release v6.1. My task is to detect a custom class which are very small objects typically 10x10 pixels. The model I chose was yolov5s with default 640x640 input size, but I think other models are also compatible. Least ModificationSince the main issue of running yolov5 with SNPE is caused by the unsupported 5d reshape operation, simply changing the 5d reshape to 4d can solve the problem. For example, one of the detection head using 1x3x85x20x20 reshape is unacceptable to SNPE, but is acceptable after changing it to 3x85x20x20 reshape. In a word, just eliminate the batchsize. The modification in the In # x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous() # original
x[i] = x[i].view(self.na, self.no, ny, nx).permute(0, 2, 3, 1).contiguous() # modified In # grid = torch.stack((xv, yv), 2).expand((1, self.na, ny, nx, 2)).float() #original
grid = torch.stack((xv, yv), 2).expand((self.na, ny, nx, 2)).float() #modified
# anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((1, self.na, 1, 1, 2)).expand((1, self.na, ny, nx, 2)).float() #original
anchor_grid = (self.anchors[i].clone() * self.stride[i]).view((self.na, 1, 1, 2)).expand((self.na, ny, nx, 2)).float() #modified No other modification needed, directly convert the original pt model to onnx then dlc without need to specify the out_nodes. The SNPE is able to compute the entire network including the operations inside Both CPU and DSP runtimes can execute this network without raising any error. However, it will only give you the correct output by using CPU. The precision is affected significantly with 8-bit quantization when using DSP, mainly caused by the operations in Running with DSPIf you just need to run with default CPU then the above solution may be the simpliest one. But I believe most of us choose SNPE because of the accelartion by DSP/AIP. So the reimplementation of detection part is unavoidable. The modification in the # if self.inplace:
# y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
# else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
# xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
# wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
# y = torch.cat((xy, wh, y[..., 4:5]), -1) Also remember to include the change of 5d reshape to 4d discussed above. The final output shape will still be consistent with the original model which is a 3d tensor (actually 2d): 1x25200x85 for the default yolov5s model. This output tensor can be obtained by DSP/AIP runtime with acceleration but no large precision drop. Then we use CPU to parse this output by performing exactly the same operation that we have commented. Since the output from SNPE is always 1d, a single for loop is enough to do the parsing. An example code is shown below, which is written in Java but also easy to convert to C++ etc. float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
if (values[c]>=conf) {
float cx = values[c-4];
float cy = values[c-3];
float w = values[c-2];
float h = values[c-1];
int gridX, gridY;
int anchor_gridX, anchor_gridY;
int[] anchorX = {10,16,33,30,62,59,116,156,373};
int[] anchorY = {13,30,23,61,45,119,90,198,326};
int[] num_filters = {19200,4800,1200};
int[] filter_size = {80,40,20};
int stride;
int ci = (int)(c/85);
if (ci<num_filters[0]) {
gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
stride = 8;
} else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
stride = 16;
} else {
gridX = ((ci-num_filters[1])%(filter_size[2]*filter_size[2]))%filter_size[2];
gridY = (int)(((ci-num_filters[1])%(filter_size[2]*filter_size[2]))/filter_size[2]);
anchor_gridX = anchorX[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
anchor_gridY = anchorY[(int)((ci-num_filters[1])/(filter_size[2]*filter_size[2]))+6];
stride = 32;
}
cx = (float)(cx*2-0.5+gridX)*stride;
cy = (float)(cy*2-0.5+gridY)*stride;
w = w*2*w*2*anchor_gridX;
h = h*2*h*2*anchor_gridY;
float left = cx-w/2;
float top = cy-h/2;
float right = cx+w/2;
float bottom = cy+h/2;
float obj_conf = values[c];
}
} The location of bounding boxes are represented by In my Android app, running the net with DSP takes about 120ms on a Snapdragon 870 platform compared to 500+ms using CPU, and the accuracy is nearly the same. However this speed is still a bit slow for real time tasks, probably cause I was using the SNPE Java SDK instead of C++. Still further optimization can be made to achieve a faster speed. Optimization for SNPEWhen looking into the yolov5 models released recently, the activation layer used after each convolution is Simply change the SiLU activations to commonly used LeakyReLU which is optimized by SNPE, by modifying the base # self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity()) # original
self.act = nn.LeakyReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity()) #modified Re-training is required and the original checkpoint can not be used. Training one epoch is faster but it will take more epochs to converge the model. Switching to LeakyReLU activations will result in slightly lower mAP but faster execution, so it is kind of performance trade off. For my specific task to detect a single class of small objects, I prune the other two detection heads for median and large objects. In addition, I select the first 5 colomns to only output x,y,w,h,conf results, so the output shape becomes 1x19200x5 instead of 1x25200x85. This will further speed up the network execution as well as the detection post processing. After all these optimizations, the final execution time for DSP drops to 25ms (almost in real time) on the same 870 device. The precision is also not much affected although it has less robustness and stability than the original yolov5s model. If your main concern is speed then apply these optimizations to your model otherwise just use the original one. Even faster execution time may achieve by switching to C++ and yolov5n model. Good luck! |
@eeyzl5 awesome, thanks for the detailed feedback! |
@eeyzl5 thanks for the detailed explanation |
Thank you for sharing the details with us. I am also trying to use DSP on an embedded system. I followed your advise and made the modifications in yolo.py but unable to run the train.py script. When I run the train.py script, following your instructions upto Running with DSP section, I get the following error.
0%| | 0/44 [00:02<?, ?it/s] Have you made any other modifications you haven't shared with us? Thank you for reading. |
Hi, just to remind that the modifications are only valid in deployment steps after you've already got a trained model and you wish to export this model to SNPE compatible format. So you just keep the original code while training, then apply the modifications when you gonna export to onnx format (refer to this), and then export to dlc format. |
Thank you very much for your comment. We were able to implement the model on the device and execute in DSP. However, we are encountering a serious issue where nothing gets detected afterwards. Have you had similar issues in the past? When we run the model in CPU, the model seems to detect something but accuracy and speed is still highly compromised. If you have any recommendations would be much appriciated. |
Hi - Any reference implementation to integrate YoloV5 in Android APP using SNPE ? We are able to successfully convert the DLC, and run on Snapdragon device, using ARM CPU, GPU, and DSP runtimes. |
Hi, I was able to get correct detections. If you try to run with DSP please refer to "Running with DSP" section from my above comment. Otherwise you may not get the correct result especially if you don't do post-processing on cpu. Post-processing includes all the operations after 5d reshape. Again you may refer to my sample codes. My suggestion is to start with the default official model and compare the raw values output from PC and your snpe device. |
I ran into a problem running the code from @eeyzl5's detailed answer (linked here for brevity) but found a likely solution. There's a bug in the "Running with DSP section" where, in the final gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
stride = 32; This is what it looks like inside the entire code snippet: float conf = 0.3;
float[] values = new float[tensorOut.getSize()];
tensorOut.read(values, 0, values.length);
for (int c=4;c<values.length;c+=85) {
if (values[c]>=conf) {
float cx = values[c-4];
float cy = values[c-3];
float w = values[c-2];
float h = values[c-1];
int gridX, gridY;
int anchor_gridX, anchor_gridY;
int[] anchorX = {10,16,33,30,62,59,116,156,373};
int[] anchorY = {13,30,23,61,45,119,90,198,326};
int[] num_filters = {19200,4800,1200};
int[] filter_size = {80,40,20};
int stride;
int ci = (int)(c/85);
if (ci<num_filters[0]) {
gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0];
gridY = (int)((ci%(filter_size[0]*filter_size[0]))/filter_size[0]);
anchor_gridX = anchorX[((int)(ci/(filter_size[0]*filter_size[0])))];
anchor_gridY = anchorY[((int)(ci/(filter_size[0]*filter_size[0])))];
stride = 8;
} else if (ci>=num_filters[0]&&ci<num_filters[0]+num_filters[1]) {
gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1];
gridY = (int)(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1]);
anchor_gridX = anchorX[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
anchor_gridY = anchorY[(int)((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3];
stride = 16;
} else {
gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2];
gridY = (int)(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2]);
anchor_gridX = anchorX[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
anchor_gridY = anchorY[(int)((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6];
stride = 32;
}
cx = (float)(cx*2-0.5+gridX)*stride;
cy = (float)(cy*2-0.5+gridY)*stride;
w = w*2*w*2*anchor_gridX;
h = h*2*h*2*anchor_gridY;
float left = cx-w/2;
float top = cy-h/2;
float right = cx+w/2;
float bottom = cy+h/2;
float obj_conf = values[c];
}
} And since I happened to need this in Python, here's that too in case it's useful (it returns a copy instead of operating in-place): def postprocess_raw_output(
values,
anchorX=[10,16,33,30,62,59,116,156,373],
anchorY=[13,30,23,61,45,119,90,198,326],
num_filters=[19200,4800,1200],
filter_size=[80,40,20],
last_dim_size=85
):
ret = values.copy()
for c in range(4, values.size, last_dim_size):
cx = values[c-4]
cy = values[c-3]
w = values[c-2]
h = values[c-1]
ci = int(c / last_dim_size)
if ci < num_filters[0]:
gridX = (ci%(filter_size[0]*filter_size[0]))%filter_size[0]
gridY = int((ci%(filter_size[0]*filter_size[0]))/filter_size[0])
anchor_gridX = anchorX[int(ci/(filter_size[0]*filter_size[0]))]
anchor_gridY = anchorY[int(ci/(filter_size[0]*filter_size[0]))]
stride = 8
elif ci>=num_filters[0] and ci<(num_filters[0]+num_filters[1]):
gridX = ((ci-num_filters[0])%(filter_size[1]*filter_size[1]))%filter_size[1]
gridY = int(((ci-num_filters[0])%(filter_size[1]*filter_size[1]))/filter_size[1])
anchor_gridX = anchorX[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
anchor_gridY = anchorY[int((ci-num_filters[0])/(filter_size[1]*filter_size[1]))+3]
stride = 16
else:
gridX = ((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))%filter_size[2]
gridY = int(((ci-num_filters[1]-num_filters[0])%(filter_size[2]*filter_size[2]))/filter_size[2])
anchor_gridX = anchorX[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
anchor_gridY = anchorY[int((ci-num_filters[1]-num_filters[0])/(filter_size[2]*filter_size[2]))+6]
stride = 32
cx = float(cx*2-0.5+gridX)*stride
cy = float(cy*2-0.5+gridY)*stride
w = w*2*w*2*anchor_gridX
h = h*2*h*2*anchor_gridY
ret[c-4:c] = [cx, cy, w, h]
return ret Hopefully this is of use particularly to @hansoullee20. |
Thank you so much for such a detailed reply...or even a report! I was doing exactly the same thing as you did, which is to detect a single class of objects smaller than 10 * 10. What I did for the network is almost the same as you explained, except changing silu to leakyrelu (I just make it relu instead). I've tested yolov5n with backbone scale = 0.2 or 0.3, and it took about 8-10 for networks in DSP (snapdragon 855, 640*640). I finally choose nanodet plus with the same custom changings for more convenient pre and post-process codes (the repository of nanodet contains official pre and post-process code for SNPE). But there is another problem. It took me about 10ms for network inference, which is acceptable. When it comes to pre and post-process, things changed. I remember it was about 3ms for pre and 4ms for post, so the total time was over 17ms for a single image. I wonder whether you had bothered by this issue or not. Finally, great appreciation for your sharing! |
Hey, thanks for the help. I am able to convert they way you mentioned. I am trying to demo using gstreamer and qtimlesnpe. I can see model running as pipeline is taking some time, but no bounding boxes on the video. I have seen this behavior before and going to a previous version of libqtioverlay.so solved the issue. Not sure how it solved but it worked. @jayer95 @Mohit-Ak Any idea how I can deal with it or maybe if you can somehow find what version libqtioverlay.so library was used on your end ? |
@saadwaraich1 The codes of libqtioverlay.so and other qtimlesnpe plugins need to be rewritten and covered. Are you a Qualcomm chip buyer? Please contact Qualcomm’s customer support directly and ask Qualcomm’s technical staff by raising a case. |
|
안녕하세요
어디까지 실행해 보셨나요? 위 스레드에서 구체적으로 이해가 안가시는 부분이 있으신가요?
…On Thu, May 11, 2023, 11:06 AM teddy ***@***.***> wrote:
So, good news! Seems like yolov5 is now compatible with SNPE! Pull from
the master branch, export to onnx, and convert to dlc without specifying
out_node. Would appreciate any inputs on how to proceed from here in SNPE :)
@hansoullee20 <https://github.com/hansoullee20> hi hansoul im also trying
to run yolov5 on snpe-sdk may i eamil to you ?
—
Reply to this email directly, view it on GitHub
<#4790 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@wofvh 안녕하세요, 해당 정보에 대해 감사드립니다. 제가 이해하기로는 yolov5가 SNPE와 호환되어 문제가 해결된 것 같습니다. 어떻게 진행해야 하는지에 대해 추가적인 정보를 공유해주시면 감사하겠습니다. 감사합니다. |
@hansoullee20 답변정말 감사드립니다 지금 현재 qualcomm 튜토리얼보고 리눅스 18.04 _x86_64 에서 inceptionv3 모델을 dlc 로 바꾸고 다시 양자화 해서 무게를 줄이는거 까지는 완료 했습니다 지금 저에게 yolov5 .onnx 가중치가 있는데 이걸 snpe-sdk 에서 어떻게 dlc로 바꿔서 스넵드레곤이 아닌 qualcomm RB5에 실행시킬수있는지 궁금합니다 ! 도와주시면 정말 감사하겠습니다 |
@hansoullee20 안녕하세요, 좋은 소식입니다. yolov5가 SNPE와 호환되어 문제가 해결된 것으로 보입니다. 하지만, 제가 직접 구체적으로 실행하지는 않았기 때문에 어떻게 진행해야 하는지에 대해서는 정확한 정보를 제공드리기 어려울 것 같습니다. 추가적인 정보가 필요하시다면, 코드를 분석하거나 질문과 관련된 문서를 참고하시는 것이 좋을 것입니다. 좀 더 구체적인 질문이 있다면 언제든지 문의해 주세요. 감사합니다. |
qualcomm RB5에서 실행시킬수있는지는 잘 모르겠습니다.
양자화 까지 완료 하셧다면 혹시 5D 에서 4D변환후 DLC 생성하고 양자화 하셧나요?
혹은 onnx에서 dlc로 변환하시는게 궁금하신건가요?
…On Thu, May 11, 2023, 12:04 PM teddy ***@***.***> wrote:
안녕하세요 어디까지 실행해 보셨나요? 위 스레드에서 구체적으로 이해가 안가시는 부분이 있으신가요?
… <#m_-356277837338867266_m_-2144071634188085371_>
On Thu, May 11, 2023, 11:06 AM teddy *@*.*> wrote: So, good news! Seems
like yolov5 is now compatible with SNPE! Pull from the master branch,
export to onnx, and convert to dlc without specifying out_node. Would
appreciate any inputs on how to proceed from here in SNPE :) @hansoullee20
<https://github.com/hansoullee20> https://github.com/hansoullee20
<https://github.com/hansoullee20> hi hansoul im also trying to run yolov5
on snpe-sdk may i eamil to you ? — Reply to this email directly, view it on
GitHub <#4790 (comment)
<#4790 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA
<https://github.com/notifications/unsubscribe-auth/AJQ7PGFXEFXCEVIQEKQRWJDXFRCSNANCNFSM5EAN3ODA>
. You are receiving this because you were mentioned.Message ID: @.*>
@hansoullee20 <https://github.com/hansoullee20> 답변정말 감사드립니다 지금 현재
qualcomm 튜토리얼보고 리눅스 18.04 _x86_64 에서 inceptionv3 모델을 dlc 로 바꾸고 다시 양자화 해서
무게를 줄이는거 까지는 완료 했고 저에게 yolov5 .onnx 가중치가 있는데 이걸 snpe-sdk 에서 어떻게 dlc로 바꿔서
스넵드레곤이 아닌 qualcomm RB5에 실행시킬수있는지 궁금합니다 !
—
Reply to this email directly, view it on GitHub
<#4790 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJQ7PGFPXBXOYA62GHITMXLXFRJMVANCNFSM5EAN3ODA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@hansoullee20 qualcomm sdk 페이지에 나와있는 기본적인 방법으로 inception_v3모델을 5D에서 4D 변환후 양자화 까지는 했는데 YOLOV5가중치는 어떻게 dlc파일로 변환후 양자화는법을 잘 모르겠습니다 제가 이쪽에 대해 많이 부족해서 조금 쉽게 알려주시면 감사하겠습니다 onnx 파일까진 변환이 되어있는 상태입니다 감사합니다 현재 우분투 18.04 파이썬 3.6.9 |
Hello @wofvh, I have been following the Qualcomm SDK page and was able to convert the Inception_v3 model from 5D to 4D and perform quantization successfully. However, I am having difficulty with converting YOLOv5 weights to a dlc file and then performing quantization. As I am relatively new to this topic, I was hoping you could provide some guidance on how to proceed with these steps in a relatively easy-to-understand manner. Currently, the onnx file has been converted already. Thank you for your help. Best, |
Hi Glenn |
@aleshem hi there, It's great to hear that you've managed to convert to dlc and run after quantization. Regarding the issues you're facing with quantizing the network using real images, it's not uncommon to encounter challenges during this process. Quantization can sometimes lead to a degradation in model performance, especially if the quantization process or the selection of calibration images is not optimal. Here are a few general tips that might help improve the quantization results:
Unfortunately, without more specific details, it's challenging to provide more targeted advice. I recommend reviewing the documentation and resources available for the specific quantization tools you're using, as they might offer insights or best practices specific to their methodology. Remember, the community and forums dedicated to the specific tools or frameworks you're using can also be valuable resources for advice and troubleshooting. Best of luck with your quantization efforts, and feel free to reach out if you have more specific questions or issues. Best regards. |
Hello, thank you very much for your detailed report. Based on your work, I further tested the quantized YOLOv5 model running on the Qualcomm SNPE DSP. I found that in the final output tensor, the object detection boxes could be detected properly, but strangely their confidence scores (which is y[4:6] in the code below) were very low. Combining with SNPE's quantization tool "snpe-dlc-quantize", I speculate that this is because SNPE adopts a rather basic method of post-training quantization, which is 𝑄 = round((FP32 − Min)/Scale Factor) + Zero Point. if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, y[..., 4:5]), -1) Before the code block, xy, wh, and conf are tensors ranging from 0 to 1, so their quantization does not cause much loss of accuracy. However, after this code block, xy scales to a range approximately equal to the size of the image, wh scales to the size of the target in pixel range, but only confidence remains within the range of 0 to 1. After this code block, all these variables including xy, wh, and conf are concatenated into one vector. For SNPE, one vector shares one quantization parameter. During quantization, the quantization scale is equivalent to the tensor with the largest change in value. Corresponding to xy, wh, and conf, the xy with the largest change will be used as the quantization scale. This scale will cause severe loss in the range of variables, and the confidence will tend to zero. This is also the reason why there is no output after quantization. For example, for an image of size 640x640, using Int8 quantization, the Scale Factor is 640/256 = 2.5, which is even larger than the entire range of conf. This is the fundamental reason for the significant loss of quantization accuracy. To address this issue, the strategy I employed here is a very naive one: I multiply the confidence scores (conf) by a coefficient to scale them along with xy and wh to the same range. This prevents excessive loss of precision during quantization. After the final model output, I divide it by the same coefficient to obtain normal detection confidence. if self.inplace:
y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
else: # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i] # xy
wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i] # wh
conf = y[..., 4:6]*1280 # 1280 is the max length of the input size, which can be the width of the input image
y = torch.cat((xy, wh, conf), -1) By using this method, adjustments only need to be made to the confidence scores at the end, without the need to add additional code elsewhere. Through testing, I found that the quantization accuracy loss resulting from this approach is acceptable. |
Hello, Thanks for sharing your exploration and solution to the quantization accuracy loss issue when running YOLOv5 with SNPE. It's insightful to see how adjusting the range of the confidence scores can help mitigate precision loss due to quantization. Your approach to scale the confidence scores to be in line with other tensor ranges and then scaling back for final output is a clever workaround. This strategy could be beneficial for others facing similar quantization challenges. It's always exciting to see community members contributing novel solutions to complex problems. Keep up the good work, and thank you for contributing to the broader knowledge base around YOLOv5 and SNPE! |
BTW, I managed to convert yolov8 using SNPE 2.10 (opset=12 in export) without any major changes. in case your chip supports this version it may save you a lot of time |
Hey there! That's fantastic news! 🎉 It's always great to hear about smooth conversions, especially with newer versions like YOLOv8 and SNPE 2.10. Your sharing could indeed save a lot of time for many in the community working on similar projects. If there are specific steps or minor tweaks that helped you along the way, feel free to drop those details. Every bit helps! Thanks for sharing, and happy coding! 👍 |
The major problem at the moment, is that the quantization doesn't work well for yolov8-pose, for some reason it ruins the confidence |
Hello! Thanks for reaching out about the quantization issue with YOLOv8-pose. It's not uncommon for quantization to affect model confidence, as precision loss can significantly impact the network's output. 🤔 A potential approach is to experiment with different quantization techniques or tools that might offer better control over precision loss. Considering calibration datasets that closely represent your use case might also help mitigate this issue. It's all about finding the right balance for your specific scenario. If this doesn't resolve the problem, could you share more details about the quantization method you're using? This info might provide further insights for troubleshooting. Best regards! |
❔Question
I am trying to convert a trained yolov5s model to an SNPE format in order to be able to run it on a Snapdragon chip. Unfortunately, Qualcomm's ONNX to SNPE converter fails on the Detect level with the following error message
I can imagine, it may have something to do with the fact that SNPE currently supports 4D input data, where the first dimension is batch SNPE doc and yolov5 Detect layer has 5D reshape.
Would it be possible to modify Detect layer so that no 5D reshape is performed?
The text was updated successfully, but these errors were encountered: