Skip to content

Add warp_perspective operator #5542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 18, 2024

Conversation

banasraf
Copy link
Collaborator

Category:

New feature

Description:

It adds a new experimental.warp_perspective operator that uses CV-CUDA operator as its implementation.

Additional information:

Affected modules and functionalities:

New operator

Key points relevant for the review:

Correctness of handing of the parameteres

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@banasraf banasraf force-pushed the add-warp-perspective-operator branch from eacdc59 to a0f9683 Compare June 28, 2024 10:57
elif dtype == np.int16 or dtype == np.uint16:
eps = 5

test_utils.compare_pipelines(pipe1, pipe2, batch_size=bs, N_iterations=10, eps=eps)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error test

Local variable 'eps' may be used before it is initialized.
Comment on lines 179 to 180
matrix = AcquireTensorArgument(ws, scratchpad, matrix_arg_, TensorShape<1>(9),
nvcvop::GetDataType<float>(), "W");
Copy link
Contributor

@mzient mzient Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we need to apply a fixup to the matrix to match WarpAffine. We can add OpenCV compatibility (here and in WarpAffine) as an option, but I think being self-consistent is far better than randomly matching a patchwork of common libraries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done + added tests for compatibility with warp_affine

@@ -264,7 +264,7 @@ if (BUILD_CVCUDA)
set(DALI_BUILD_PYTHON ${BUILD_PYTHON})
set(BUILD_PYTHON OFF)
# for now we use only median blur from CV-CUDA
set(CV_CUDA_SRC_PATERN medianblur median_blur morphology)
set(CV_CUDA_SRC_PATERN medianblur median_blur morphology warp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I know it's not you but this should be PATTERN

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'll need to be fixed in cv-cuda

@@ -0,0 +1,372 @@
# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a lot of parameter handling with tons of bug-prone ifs. I think we need to have some negative tests to check if we handle invalid argument combinations properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few tests that check throwing errors.

@banasraf banasraf force-pushed the add-warp-perspective-operator branch from b7828eb to d0b1ac7 Compare July 16, 2024 09:28
@banasraf
Copy link
Collaborator Author

!build

@mzient mzient self-assigned this Jul 16, 2024
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16634050]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16634050]: BUILD FAILED

@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16639288]: BUILD STARTED

Comment on lines 32 to 50
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int matrix_id = tid / 4;
if (matrix_id >= batch_size) {
return;
}
auto *data_ptr = wrap.ptr(matrix_id);
auto *matrix = reinterpret_cast<mat3*>(data_ptr);
int sub_tid = tid % 4;
if (sub_tid % 2 == 0) {
// this modifies only the first two rows
int row_id = sub_tid / 2;
matrix->set_row(row_id, matrix->row(row_id) - matrix->row(2) * 0.5);
}
__syncthreads();
if (sub_tid < 4) {
// this modifies only the third column
int row_id = sub_tid;
matrix->at(row_id, 2) = dot(matrix->row(row_id), vec3{0.5, 0.5, 1});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it overcomplicated? Even with a huge batch, we're talking about two 3x3 matrix multiplications per sample. The code is not very readable and still arguably the most costly part - loading the matrices from global memory - isn't optimized.
We can probably get away with sample-per-thread and just doing the whole thing - or we can go with a block of 32x9.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I benchmarked a few variants of the kernel and they made almost no difference on end to end op performance, so I went with single thread per matrix and the kernel is as simple as it can get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested on more powerful cards like SXM H100?


using MatricesWrap = nvcv::cuda::TensorWrap<float, 9 * sizeof(float), sizeof(float)>;

__global__ void adjustMatricesKernel(MatricesWrap wrap, int batch_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that the same fixup can be applied for dst->src and src->dst mapping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16639288]: BUILD FAILED

Comment on lines 163 to 164
nvcv::Tensor AsTensor(const Tensor<GPUBackend> &tensor, TensorLayout layout = "",
const std::optional<TensorShape<>> &reshape = {}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nvcv::Tensor AsTensor(const Tensor<GPUBackend> &tensor, TensorLayout layout = "",
const std::optional<TensorShape<>> &reshape = {}) {
nvcv::Tensor AsTensor(const Tensor<GPUBackend> &tensor, TensorLayout layout,
const std::optional<TensorShape<>> &reshape) {

Wouldn't it be better to assign default arguments in the definition (in nvcvop.h) instead of here?
Also, if you're using std::optional, I believe that better default value would be std::nullopt instead of default constructor (I might be wrong though).

Edit: on the second thought, I understand this const optional & as "reference which might not contain a value". Would this be just const TensorShape<> * = nullptr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the default values to the definition and used nullopt instead of {}.

Regarding the optional vs nullable pointer: I prefer optional, it's safer imo

Signed-off-by: Rafal Banas <[email protected]>
@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16674363]: BUILD STARTED

channels, ". Number of values provided: ", fill_value_arg_.size(), "."));
}
} else {
DALI_FAIL("Only scalar fill_value can be provided when processing data in planar layout.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that CV-CUDA limitation or ours?

Copy link
Collaborator Author

@banasraf banasraf Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say CV-CUDA. The problem is CV-CUDA doesn't support planar layouts at all (in this op), and it only accepts single fill_value for the whole batch, so I cannot provide different fill value for each plane

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could launch the planes separately, but whatever.

@@ -1615,6 +1615,7 @@ def full_like_pipe():
"experimental.median_blur", # not supported for CPU
"experimental.dilate", # not supported for CPU
"experimental.erode", # not supported for CPU
"experimental.warp_perspective", # not supported for CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why experimental?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All cv-cuda ops so far are in experimental. We can think of moving them all out soon. My major concern is the not-so-clear situation with depending on CV-CUDA. Right now we compile the sources but at some point we're going to move to actual dependency. That potentially could affect those ops

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16674363]: BUILD FAILED

{0., 0., 1.}
}};

*matrix = shift_back * *matrix * shift;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the compiler will be smart here. I recommend doing this:

  auto m = *matrix;  // this will make sure that the matrix is actually in registers
  m = m * shift;
  m(0, 2) -= 0.5f;  // there's no point in running full matrix multiplication for the shift-back when all we need is 2 additions
  m(1, 2) -= 0.5f;
  *matrix = m;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two additions is not all what you need. I changed it to:

  matrix = matrix * shift;
  matrix.set_row(0, matrix.row(0) - matrix.row(2) * 0.5f);
  matrix.set_row(1, matrix.row(1) - matrix.row(2) * 0.5f);

Signed-off-by: Rafal Banas <[email protected]>
@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16702897]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16702897]: BUILD FAILED

Signed-off-by: Rafal Banas <[email protected]>
@banasraf
Copy link
Collaborator Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16706360]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16706360]: BUILD FAILED

@@ -264,7 +264,7 @@ if (BUILD_CVCUDA)
set(DALI_BUILD_PYTHON ${BUILD_PYTHON})
set(BUILD_PYTHON OFF)
# for now we use only median blur from CV-CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is no longer up to date

.InputDox(0, "input", "TensorList of uint8, uint16, int16 or float",
"Input data. Must be images in HWC or CHW layout, or a sequence of those.")
.InputDox(1, "matrix_gpu", "1D TensorList of float",
"Transformation matrix data. Should be used to pass the GPU data. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Transformation matrix data. Should be used to pass the GPU data. "
"Transformation matrix data, on GPU memory. "

* @brief Modifies (in-place) tensor of perspective matrices to match
* the OpenCV convention of pixel origin (center instead of corner).
*/
void adjustMatrices(nvcv::Tensor &matrices, cudaStream_t stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void adjustMatrices(nvcv::Tensor &matrices, cudaStream_t stream);
void adjustMatricesToPixelCenter(nvcv::Tensor &matrices, cudaStream_t stream);

nitpick: or something more explicit

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16706360]: BUILD PASSED

@banasraf banasraf merged commit a473467 into NVIDIA:main Jul 18, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants