-
Notifications
You must be signed in to change notification settings - Fork 631
Add warp_perspective operator #5542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Rafal Banas <[email protected]>
eacdc59
to
a0f9683
Compare
matrix = AcquireTensorArgument(ws, scratchpad, matrix_arg_, TensorShape<1>(9), | ||
nvcvop::GetDataType<float>(), "W"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we need to apply a fixup to the matrix to match WarpAffine. We can add OpenCV compatibility (here and in WarpAffine) as an option, but I think being self-consistent is far better than randomly matching a patchwork of common libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done + added tests for compatibility with warp_affine
@@ -264,7 +264,7 @@ if (BUILD_CVCUDA) | |||
set(DALI_BUILD_PYTHON ${BUILD_PYTHON}) | |||
set(BUILD_PYTHON OFF) | |||
# for now we use only median blur from CV-CUDA | |||
set(CV_CUDA_SRC_PATERN medianblur median_blur morphology) | |||
set(CV_CUDA_SRC_PATERN medianblur median_blur morphology warp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I know it's not you but this should be PATTERN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'll need to be fixed in cv-cuda
@@ -0,0 +1,372 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do a lot of parameter handling with tons of bug-prone ifs. I think we need to have some negative tests to check if we handle invalid argument combinations properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few tests that check throwing errors.
… fixes Signed-off-by: Rafal Banas <[email protected]>
b7828eb
to
d0b1ac7
Compare
!build |
CI MESSAGE: [16634050]: BUILD STARTED |
CI MESSAGE: [16634050]: BUILD FAILED |
Signed-off-by: Rafal Banas <[email protected]>
!build |
CI MESSAGE: [16639288]: BUILD STARTED |
int tid = blockIdx.x * blockDim.x + threadIdx.x; | ||
int matrix_id = tid / 4; | ||
if (matrix_id >= batch_size) { | ||
return; | ||
} | ||
auto *data_ptr = wrap.ptr(matrix_id); | ||
auto *matrix = reinterpret_cast<mat3*>(data_ptr); | ||
int sub_tid = tid % 4; | ||
if (sub_tid % 2 == 0) { | ||
// this modifies only the first two rows | ||
int row_id = sub_tid / 2; | ||
matrix->set_row(row_id, matrix->row(row_id) - matrix->row(2) * 0.5); | ||
} | ||
__syncthreads(); | ||
if (sub_tid < 4) { | ||
// this modifies only the third column | ||
int row_id = sub_tid; | ||
matrix->at(row_id, 2) = dot(matrix->row(row_id), vec3{0.5, 0.5, 1}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it overcomplicated? Even with a huge batch, we're talking about two 3x3 matrix multiplications per sample. The code is not very readable and still arguably the most costly part - loading the matrices from global memory - isn't optimized.
We can probably get away with sample-per-thread and just doing the whole thing - or we can go with a block of 32x9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benchmarked a few variants of the kernel and they made almost no difference on end to end op performance, so I went with single thread per matrix and the kernel is as simple as it can get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested on more powerful cards like SXM H100?
|
||
using MatricesWrap = nvcv::cuda::TensorWrap<float, 9 * sizeof(float), sizeof(float)>; | ||
|
||
__global__ void adjustMatricesKernel(MatricesWrap wrap, int batch_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth mentioning that the same fixup can be applied for dst->src and src->dst mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
CI MESSAGE: [16639288]: BUILD FAILED |
dali/operators/nvcvop/nvcvop.cc
Outdated
nvcv::Tensor AsTensor(const Tensor<GPUBackend> &tensor, TensorLayout layout = "", | ||
const std::optional<TensorShape<>> &reshape = {}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvcv::Tensor AsTensor(const Tensor<GPUBackend> &tensor, TensorLayout layout = "", | |
const std::optional<TensorShape<>> &reshape = {}) { | |
nvcv::Tensor AsTensor(const Tensor<GPUBackend> &tensor, TensorLayout layout, | |
const std::optional<TensorShape<>> &reshape) { |
Wouldn't it be better to assign default arguments in the definition (in nvcvop.h
) instead of here?
Also, if you're using std::optional
, I believe that better default value would be std::nullopt
instead of default constructor (I might be wrong though).
Edit: on the second thought, I understand this const optional &
as "reference which might not contain a value". Would this be just const TensorShape<> * = nullptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the default values to the definition and used nullopt instead of {}.
Regarding the optional vs nullable pointer: I prefer optional, it's safer imo
Signed-off-by: Rafal Banas <[email protected]>
!build |
CI MESSAGE: [16674363]: BUILD STARTED |
channels, ". Number of values provided: ", fill_value_arg_.size(), ".")); | ||
} | ||
} else { | ||
DALI_FAIL("Only scalar fill_value can be provided when processing data in planar layout."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that CV-CUDA limitation or ours?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say CV-CUDA. The problem is CV-CUDA doesn't support planar layouts at all (in this op), and it only accepts single fill_value for the whole batch, so I cannot provide different fill value for each plane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could launch the planes separately, but whatever.
@@ -1615,6 +1615,7 @@ def full_like_pipe(): | |||
"experimental.median_blur", # not supported for CPU | |||
"experimental.dilate", # not supported for CPU | |||
"experimental.erode", # not supported for CPU | |||
"experimental.warp_perspective", # not supported for CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why experimental?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All cv-cuda ops so far are in experimental. We can think of moving them all out soon. My major concern is the not-so-clear situation with depending on CV-CUDA. Right now we compile the sources but at some point we're going to move to actual dependency. That potentially could affect those ops
CI MESSAGE: [16674363]: BUILD FAILED |
{0., 0., 1.} | ||
}}; | ||
|
||
*matrix = shift_back * *matrix * shift; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the compiler will be smart here. I recommend doing this:
auto m = *matrix; // this will make sure that the matrix is actually in registers
m = m * shift;
m(0, 2) -= 0.5f; // there's no point in running full matrix multiplication for the shift-back when all we need is 2 additions
m(1, 2) -= 0.5f;
*matrix = m;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two additions is not all what you need. I changed it to:
matrix = matrix * shift;
matrix.set_row(0, matrix.row(0) - matrix.row(2) * 0.5f);
matrix.set_row(1, matrix.row(1) - matrix.row(2) * 0.5f);
Signed-off-by: Rafal Banas <[email protected]>
!build |
CI MESSAGE: [16702897]: BUILD STARTED |
CI MESSAGE: [16702897]: BUILD FAILED |
Signed-off-by: Rafal Banas <[email protected]>
!build |
CI MESSAGE: [16706360]: BUILD STARTED |
CI MESSAGE: [16706360]: BUILD FAILED |
@@ -264,7 +264,7 @@ if (BUILD_CVCUDA) | |||
set(DALI_BUILD_PYTHON ${BUILD_PYTHON}) | |||
set(BUILD_PYTHON OFF) | |||
# for now we use only median blur from CV-CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is no longer up to date
.InputDox(0, "input", "TensorList of uint8, uint16, int16 or float", | ||
"Input data. Must be images in HWC or CHW layout, or a sequence of those.") | ||
.InputDox(1, "matrix_gpu", "1D TensorList of float", | ||
"Transformation matrix data. Should be used to pass the GPU data. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Transformation matrix data. Should be used to pass the GPU data. " | |
"Transformation matrix data, on GPU memory. " |
* @brief Modifies (in-place) tensor of perspective matrices to match | ||
* the OpenCV convention of pixel origin (center instead of corner). | ||
*/ | ||
void adjustMatrices(nvcv::Tensor &matrices, cudaStream_t stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void adjustMatrices(nvcv::Tensor &matrices, cudaStream_t stream); | |
void adjustMatricesToPixelCenter(nvcv::Tensor &matrices, cudaStream_t stream); |
nitpick: or something more explicit
CI MESSAGE: [16706360]: BUILD PASSED |
Category:
New feature
Description:
It adds a new experimental.warp_perspective operator that uses CV-CUDA operator as its implementation.
Additional information:
Affected modules and functionalities:
New operator
Key points relevant for the review:
Correctness of handing of the parameteres
Tests:
Checklist
Documentation
DALI team only
Requirements
REQ IDs: N/A
JIRA TASK: N/A