Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About reimplement training code #12

Closed
Crazylov3 opened this issue Apr 24, 2023 · 9 comments
Closed

About reimplement training code #12

Crazylov3 opened this issue Apr 24, 2023 · 9 comments

Comments

@Crazylov3
Copy link

Hi
I am very interested in your beautiful work.
I want to reimplement the training code to match the requirement of my task. I have some questions about technical details in order to reproduce your results. I would be grateful if you could help me.

  1. In your paper, you give pseudo-code:
def loss(image_0, model):
          # get warped image and pixel correspondences
         image_1, corr_0, corr_1 = rand_homo(image_0)
         ..................

What is corr_0, corr_1 in this case? Are they float (Sub-pixel) or integer (Pixel level)? And how can I achieve that?
2. Because the heatmap (of detection head) has shape H - 2x9, W - 2x9. How can I match it to the original image (in both the training process and inference process)? In my natural thought, I will simply interpolate the heatmap to match the original image size, but it seems not the same as your implementation.
3. The output of the model (in sparse mode) returns the value of coord of the keypoint which is not in integer format (Eg. (20.5, 15.5)). What method did you use to extract coord of the key point in this format from the heatmap ()

@gleize
Copy link
Contributor

gleize commented Apr 25, 2023

Hi @Crazylov3,

  1. What is corr_0, corr_1 in this case? Are they float (Sub-pixel) or integer (Pixel level)? And how can I achieve that?

The pseudo code doesn't enforce any implementation decision. It just represents correspondences, and essentially could be both float or integer.

In our code however, they encode integer indices of pixels in flatten images. For example, corr_0[i] = j would mean that the ith pixel of image 0 would correspond to the jth pixel in image 1. The code is here.

  1. Because the heatmap (of detection head) has shape H - 2x9, W - 2x9. How can I match it to the original image (in both the training process and inference process)? In my natural thought, I will simply interpolate the heatmap to match the original image size, but it seems not the same as your implementation.

It cannot be done, and would be incorrect to interpolate, as I explained here. Why do you need it to be the same size ?

  1. The output of the model (in sparse mode) returns the value of coord of the keypoint which is not in integer format (Eg. (20.5, 15.5)).

The 0.5 comes from enforcing the keypoints to be at the center of the pixels. (0.5, 0.5) being the first pixel, (1.5, 0.5) being the second one, etc. This helps to make index discretization robust to small noise (e.g. int(1.5+ $\epsilon$) would always give 1 for small $\epsilon$, while int(1.+ $\epsilon$) would either give 0 or 1 randomly).

What method did you use to extract coord of the key point in this format from the heatmap ()

We select the positions with top-k heatmap scores, then add 0.5 to their integer indices to center them.

@javierttgg
Copy link

@gleize just to confirm, do the sparse coordinates returned by the model -after adding 0.5- follow the convention of using (0, 0) as the top-left corner of the top-left pixel of the image?

@gleize
Copy link
Contributor

gleize commented Apr 26, 2023

@gleize just to confirm, do the sparse coordinates returned by the model -after adding 0.5- follow the convention of using (0, 0) as the top-left corner of the top-left pixel of the image?

Yes, that's correct.

@Crazylov3
Copy link
Author

@gleize The sparse descriptor returned by the model using coordinates after adding 0.5 then use grid_sample or before adding 0.5?

@gleize
Copy link
Contributor

gleize commented Apr 28, 2023

@Crazylov3

SiLK doesn't use grid_sample to obtain sparse descriptors (only SuperPoint does in our codebase).
Unlike SuperPoint, we get pixel-level descriptors (instead of cell-level), and therefore do not need interpolation.

The sparsification of descriptors can be found here. Incoming positions already have the +0.5 added to it, we simply floor those value to obtain the descriptors x,y indices.

@Crazylov3
Copy link
Author

Crazylov3 commented May 7, 2023

Hi @gleize
As you mention here. You tried to add a down-sample block to the backbone (eg: ResFPN). I don't understand how to work with your proposed detection loss.
For example, input image's shape (480, 480), and the down-sample factor is 8 (both for detection and description). We will have key point heatmap and descriptor map have shape (60, 60) (ignore effect from no-padding). As your implementation here, you define corners in descriptor resolution (60, 60). So each corner represents a cell (8x8). In this case, how you can perform detection loss by applying successful matching? Do you simply apply it in heatmap resolution without care about image resolution, if you do so, how do you make sure, the key point can appear in every pixel (excluding near the border due to no padding) in the inference process

In my customer, keypoint heatmap has shape (H, W) and descriptor has shape (H/8, W/8). Is it possible for me to use your detection loss idea.

Thank you for your time and assistance.

@gleize
Copy link
Contributor

gleize commented May 7, 2023

Hi @Crazylov3,

So each corner represents a cell (8x8). In this case, how you can perform detection loss by applying successful matching ?

We treat those cells as "large" pixels in the loss. The random homography gives us the mapping (and therefore the correspondences) between the cells from image 1 to those in image 2. Once we have the correspondences, we can apply the detection loss.

[...], if you do so, how do you make sure, the key point can appear in every pixel (excluding near the border due to no padding) in the inference process

If the backbone down-samples the resolution (only ResFPN in our case), we cannot get a "pixel-accurate" (i.e. at the original resolution) keypoint detector. It's more like a "keypatch" model than a "keypoint" model at that point (pun intended).

If you feed a (480, 480) image to that detector, you will first get keypoint positions in feature resolution (60, 60). Then the call to from_feature_coords_to_image_coords will change those keypoint to be in input resolution (essentially by being at the center of the 8x8 cells).

If you want ResFPN to be pixel-accurate, you will have to add some up-sampling layer (either at the end of the shared backbone, or in both detector / descriptor heads).

@Crazylov3
Copy link
Author

Thanks for your answer @gleize
Could you kindly provide the results of ResFPN? Given that achieving a "pixel-accurate" outcome seems unattainable, I anticipate a notable decrease in performance. Is the performance still comparable to other cell-based methods? Consequently, I am curious about the extent of this decrease in performance.

@gleize
Copy link
Contributor

gleize commented May 8, 2023

Hi @Crazylov3,

We provided the results of ResFPN in the paper (c.f. tab 7). There is indeed a noticeable decrease in performance.

There could be ways to improve the ResFPN architecture to make it work better, but we haven't explored that path.

@gleize gleize closed this as completed Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants