Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of multi_proposal_target_layer #1

Open
dingjiansw101 opened this issue May 12, 2019 · 5 comments
Open

Implementation of multi_proposal_target_layer #1

dingjiansw101 opened this issue May 12, 2019 · 5 comments

Comments

@dingjiansw101
Copy link

Hi, I notice that in the "MultiProposalTargetGPUOp.cu", you put some calculations in CPU then move the results back to GPU. What is the purpose?

cudaMemcpy(gt_boxes, tgt_boxes.dptr_, 5 * sizeof(float) * num_images * 100, cudaMemcpyDeviceToHost);

@bharatsingh430
Copy link
Collaborator

bharatsingh430 commented May 12, 2019

The later part of the code is just doing some padding to set invalid labels upto max number of proposals, which doesn’t require much of compute, so that’s done in C++

@dingjiansw101
Copy link
Author

How about the calculation of overlaps and targets (from line 481 to 572)?

@bharatsingh430
Copy link
Collaborator

You are right, there is more to padding over there. The number of proposals is typically 500, so the compute is only 500x500xbatchsize. This wasn’t a big issue when the code was profiled. Now there could be a use case where this becomes an issue (where number of proposals is much larger).

@dingjiansw101
Copy link
Author

Thank you! Another question, you mentioned SNIPER repo "2. NO PYTHON LAYERS (Every layer is optimized for large batch sizes in CUDA/C++)". How do you optimize it in CUDA/C++? I am really interested in it.

@bharatsingh430
Copy link
Collaborator

you can write cuda kernels differently for different batch sizes, for example in the proposal generation layer, NMS is used on top 6000/12000 proposals which is optimized for a batch of 1. This is because people are concerned about latency as well (both blocks and threads are used to compute overlaps in nms which keeps the gpu underutilized). During training you want to maximize throughout , so you can write your kernels differently. Like you can give each image to a block (which goes to a separate multi-processor) and do the overlap computation using threads (which gets executed in parallel in the cores inside an SMP), example: https://github.com/mahyarnajibi/SNIPER-mxnet/blob/master/src/operator/multi_proposal_target_mask.cu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants