About reimplement training code #12

Crazylov3 · 2023-04-24T09:38:05Z

Hi
I am very interested in your beautiful work.
I want to reimplement the training code to match the requirement of my task. I have some questions about technical details in order to reproduce your results. I would be grateful if you could help me.

In your paper, you give pseudo-code:

def loss(image_0, model):
          # get warped image and pixel correspondences
         image_1, corr_0, corr_1 = rand_homo(image_0)
         ..................

What is corr_0, corr_1 in this case? Are they float (Sub-pixel) or integer (Pixel level)? And how can I achieve that?
2. Because the heatmap (of detection head) has shape H - 2x9, W - 2x9. How can I match it to the original image (in both the training process and inference process)? In my natural thought, I will simply interpolate the heatmap to match the original image size, but it seems not the same as your implementation.
3. The output of the model (in sparse mode) returns the value of coord of the keypoint which is not in integer format (Eg. (20.5, 15.5)). What method did you use to extract coord of the key point in this format from the heatmap ()

The text was updated successfully, but these errors were encountered:

gleize · 2023-04-25T23:55:59Z

Hi @Crazylov3,

What is corr_0, corr_1 in this case? Are they float (Sub-pixel) or integer (Pixel level)? And how can I achieve that?

The pseudo code doesn't enforce any implementation decision. It just represents correspondences, and essentially could be both float or integer.

In our code however, they encode integer indices of pixels in flatten images. For example, corr_0[i] = j would mean that the ith pixel of image 0 would correspond to the jth pixel in image 1. The code is here.

Because the heatmap (of detection head) has shape H - 2x9, W - 2x9. How can I match it to the original image (in both the training process and inference process)? In my natural thought, I will simply interpolate the heatmap to match the original image size, but it seems not the same as your implementation.

It cannot be done, and would be incorrect to interpolate, as I explained here. Why do you need it to be the same size ?

The output of the model (in sparse mode) returns the value of coord of the keypoint which is not in integer format (Eg. (20.5, 15.5)).

The 0.5 comes from enforcing the keypoints to be at the center of the pixels. (0.5, 0.5) being the first pixel, (1.5, 0.5) being the second one, etc. This helps to make index discretization robust to small noise (e.g. int(1.5+ $\epsilon$) would always give 1 for small $\epsilon$, while int(1.+ $\epsilon$) would either give 0 or 1 randomly).

What method did you use to extract coord of the key point in this format from the heatmap ()

We select the positions with top-k heatmap scores, then add 0.5 to their integer indices to center them.

javierttgg · 2023-04-26T18:15:57Z

@gleize just to confirm, do the sparse coordinates returned by the model -after adding 0.5- follow the convention of using (0, 0) as the top-left corner of the top-left pixel of the image?

gleize · 2023-04-26T19:26:13Z

@gleize just to confirm, do the sparse coordinates returned by the model -after adding 0.5- follow the convention of using (0, 0) as the top-left corner of the top-left pixel of the image?

Yes, that's correct.

Crazylov3 · 2023-04-27T01:32:58Z

@gleize The sparse descriptor returned by the model using coordinates after adding 0.5 then use grid_sample or before adding 0.5?

gleize · 2023-04-28T00:00:37Z

@Crazylov3

SiLK doesn't use grid_sample to obtain sparse descriptors (only SuperPoint does in our codebase).
Unlike SuperPoint, we get pixel-level descriptors (instead of cell-level), and therefore do not need interpolation.

The sparsification of descriptors can be found here. Incoming positions already have the +0.5 added to it, we simply floor those value to obtain the descriptors x,y indices.

Crazylov3 · 2023-05-07T04:25:17Z

Hi @gleize
As you mention here. You tried to add a down-sample block to the backbone (eg: ResFPN). I don't understand how to work with your proposed detection loss.
For example, input image's shape (480, 480), and the down-sample factor is 8 (both for detection and description). We will have key point heatmap and descriptor map have shape (60, 60) (ignore effect from no-padding). As your implementation here, you define corners in descriptor resolution (60, 60). So each corner represents a cell (8x8). In this case, how you can perform detection loss by applying successful matching? Do you simply apply it in heatmap resolution without care about image resolution, if you do so, how do you make sure, the key point can appear in every pixel (excluding near the border due to no padding) in the inference process

In my customer, keypoint heatmap has shape (H, W) and descriptor has shape (H/8, W/8). Is it possible for me to use your detection loss idea.

Thank you for your time and assistance.

gleize · 2023-05-07T18:44:28Z

Hi @Crazylov3,

So each corner represents a cell (8x8). In this case, how you can perform detection loss by applying successful matching ?

We treat those cells as "large" pixels in the loss. The random homography gives us the mapping (and therefore the correspondences) between the cells from image 1 to those in image 2. Once we have the correspondences, we can apply the detection loss.

[...], if you do so, how do you make sure, the key point can appear in every pixel (excluding near the border due to no padding) in the inference process

If the backbone down-samples the resolution (only ResFPN in our case), we cannot get a "pixel-accurate" (i.e. at the original resolution) keypoint detector. It's more like a "keypatch" model than a "keypoint" model at that point (pun intended).

If you feed a (480, 480) image to that detector, you will first get keypoint positions in feature resolution (60, 60). Then the call to from_feature_coords_to_image_coords will change those keypoint to be in input resolution (essentially by being at the center of the 8x8 cells).

If you want ResFPN to be pixel-accurate, you will have to add some up-sampling layer (either at the end of the shared backbone, or in both detector / descriptor heads).

Crazylov3 · 2023-05-08T10:21:34Z

Thanks for your answer @gleize
Could you kindly provide the results of ResFPN? Given that achieving a "pixel-accurate" outcome seems unattainable, I anticipate a notable decrease in performance. Is the performance still comparable to other cell-based methods? Consequently, I am curious about the extent of this decrease in performance.

gleize · 2023-05-08T17:56:40Z

Hi @Crazylov3,

We provided the results of ResFPN in the paper (c.f. tab 7). There is indeed a noticeable decrease in performance.

There could be ways to improve the ResFPN architecture to make it work better, but we haven't explored that path.

gleize closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About reimplement training code #12

About reimplement training code #12

Crazylov3 commented Apr 24, 2023

gleize commented Apr 25, 2023

javierttgg commented Apr 26, 2023

gleize commented Apr 26, 2023

Crazylov3 commented Apr 27, 2023

gleize commented Apr 28, 2023

Crazylov3 commented May 7, 2023 •

edited

Loading

gleize commented May 7, 2023

Crazylov3 commented May 8, 2023

gleize commented May 8, 2023

About reimplement training code #12

About reimplement training code #12

Comments

Crazylov3 commented Apr 24, 2023

gleize commented Apr 25, 2023

javierttgg commented Apr 26, 2023

gleize commented Apr 26, 2023

Crazylov3 commented Apr 27, 2023

gleize commented Apr 28, 2023

Crazylov3 commented May 7, 2023 • edited Loading

gleize commented May 7, 2023

Crazylov3 commented May 8, 2023

gleize commented May 8, 2023

Crazylov3 commented May 7, 2023 •

edited

Loading