-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remaining BBox kernel perf optimizations #6896
Conversation
w_ratio = new_width / old_width | ||
h_ratio = new_height / old_height | ||
ratios = torch.tensor([w_ratio, h_ratio, w_ratio, h_ratio], device=bounding_box.device) | ||
return ( | ||
bounding_box.reshape(-1, 2, 2).mul(ratios).to(bounding_box.dtype).reshape(bounding_box.shape), | ||
bounding_box.mul(ratios).to(bounding_box.dtype), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement:
[------------ resize cpu torch.float32 ------------]
| old | new
1 threads: -----------------------------------------
(128, 4) | 13 (+- 0) us | 8 (+- 0) us
6 threads: -----------------------------------------
(128, 4) | 13 (+- 0) us | 8 (+- 0) us
Times are in microseconds (us).
[----------- resize cuda torch.float32 ------------]
| old | new
1 threads: -----------------------------------------
(128, 4) | 37 (+- 0) us | 31 (+- 0) us
6 threads: -----------------------------------------
(128, 4) | 37 (+- 0) us | 31 (+- 0) us
Times are in microseconds (us).
[------------- resize cpu torch.uint8 -------------]
| old | new
1 threads: -----------------------------------------
(128, 4) | 19 (+- 0) us | 13 (+- 0) us
6 threads: -----------------------------------------
(128, 4) | 19 (+- 0) us | 13 (+- 0) us
Times are in microseconds (us).
[------------ resize cuda torch.uint8 -------------]
| old | new
1 threads: -----------------------------------------
(128, 4) | 45 (+- 0) us | 39 (+- 0) us
6 threads: -----------------------------------------
(128, 4) | 45 (+- 0) us | 39 (+- 1) us
Times are in microseconds (us).
Maybe, we can merge this after #6879 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice optim for resize, thanks @datumbox
…astic_bounding_box`.
@vfdev-5 I just pushed a couple of untested opts. Could you check again which you think are safe? I'll do benchmarks after we confirm which ones we want in. |
I'll cherry pick those for elastic those which makes sense. Thanks for pointers! |
@@ -388,8 +389,7 @@ def _affine_bounding_box_xyxy( | |||
new_points = torch.matmul(points, transposed_affine_matrix) | |||
tr, _ = torch.min(new_points, dim=0, keepdim=True) | |||
# Translate bounding boxes | |||
out_bboxes[:, 0::2] = out_bboxes[:, 0::2] - tr[:, 0] | |||
out_bboxes[:, 1::2] = out_bboxes[:, 1::2] - tr[:, 1] | |||
out_bboxes.sub_(tr.repeat((1, 2))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement for both changes:
[-------------------- bbox_rotate cpu -------------------]
| False | True
1 threads: ----------------------------------------------
torch.float32 | 265 (+- 40) us | 225 (+- 2) us
torch.float64 | 261 (+- 1) us | 241 (+- 1) us
torch.int32 | 258 (+- 1) us | 239 (+- 2) us
torch.int64 | 260 (+- 1) us | 239 (+- 1) us
6 threads: ----------------------------------------------
torch.float32 | 466 (+- 10) us | 405 (+- 20) us
torch.float64 | 483 (+- 10) us | 422 (+- 55) us
torch.int32 | 479 (+- 10) us | 420 (+- 10) us
torch.int64 | 482 (+- 18) us | 422 (+- 10) us
Times are in microseconds (us).
[-------------------- bbox_rotate cpu -------------------]
| False | True
1 threads: ----------------------------------------------
torch.float32 | 498 (+- 46) us | 432 (+- 0) us
torch.float64 | 489 (+- 1) us | 446 (+- 0) us
torch.int32 | 503 (+- 0) us | 459 (+- 3) us
torch.int64 | 504 (+- 3) us | 458 (+- 0) us
6 threads: ----------------------------------------------
torch.float32 | 573 (+- 2) us | 530 (+- 0) us
torch.float64 | 600 (+- 20) us | 554 (+- 20) us
torch.int32 | 609 (+- 20) us | 560 (+- 10) us
torch.int64 | 598 (+- 58) us | 563 (+- 10) us
Times are in microseconds (us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @datumbox
Summary: * Bbox resize optimization * Other (untested) optimizations on `_affine_bounding_box_xyxy` and `elastic_bounding_box`. * fix conflict * Reverting changes on elastic * revert one more change * Further improvement Reviewed By: datumbox Differential Revision: D41020550 fbshipit-source-id: dfd1f2d91490b45176f1976bcec1fc99248f8587
Some of the opts highlighted at #6872
cc @vfdev-5 @bjuncek @pmeier