-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about supervision signal for Keypoint Head #21
Comments
Hi @TuanTNG,
In short, we kind of redefine keypoints from first principle (similar to DISK / GLAMpoints). The initial goal of keypoints is to be distinct and robust to reasonable viewpoint / photometric changes, so that they can be tracked across multiple frames. The research has focused on corners (e.g. Harris, SuperPoint) for a long time since corners are known to have those properties. In our work however, we focus on learning keypoints to have those properties directly instead of relying on a proxy objective (i.e. learning "cornerness"). By measuring the round-trip success, we are essentially measuring the ability of a position to become a good keypoint (i.e. having the two properties mentioned above). Descriptors that are neither distinct nor robust are unlikely to match correctly, and therefore, this is a good signal to regress the keypoint score on. By extending the definition of keypoints, we observe that our model can not only capture corners, but also more complex patterns (e.g. curves, complex textures, ...).
In theory yes, but that doesn't happen in practice. For example, images often contain large areas of uniform colors, or repetitive patterns. Given the local nature of keypoint descriptors, they do not contain enough information to obtain perfect matching.
Yes, as you said, there are no successful matches initially. So the keypoint head essentially converges towards outputting 0 everywhere, and the keypoint loss descreases accordingly (i.e. it's doing a good job at predicting every keypoints will fail matching). However, after a short while, the descriptors start to become more discriminative, successful matches become more frequent, and the keypoint loss start to increase until it stabilizes (i.e. learning what keypoints are likely to match becomes a harder problem to solve).
The inference stage is fairly straightforward, and follow the standard "detect-and-describe" pattern. We first get the dense keypoint score output, then we select the top-k positions (i.e. the best k keypoints) from that dense map. Once we know the keypoint positions, we can extract the associated descriptors at those positions (from the dense descriptor map).
It's difficult to get a fair comparison of FPS across papers since there are multiple factors that could affect speed (implementation quality, hardware, ...). Our released FPS numbers are given to be compared relatively to each other to get a sense of relative speed between backbones, but should not be taken as absolute since speed is often a function of engineering effort (e.g. SiLK could be put on a chip and become orders of magnitude faster) and hardware. That being said, I've just ran a quick SiLK VGG-4 vs SuperPoint comparison to get some specific numbers. On 480x269 images (and a different machine than the one used in the paper), SuperPoint gets 83 FPS while SiLK gets 30 FPS. The large gap is explained by the lack of downsampling layers in SiLK. We don't consider that to be too bad, and would likely benefit from further architectural investigations, as future work. Additionally, an interesting consequence of our architecture is the ability to become more accurate when using a smaller resolution (c.f. supplementary table 10), given an error threshold of 3. This shows that SiLK can still beat SuperPoint on the @3 metrics even when reducing the resolution by a factor of three. When doing so, SiLK gets a FPS of 68, which becomes a lot closer to SuperPoint's numbers. I hope those answers help. |
Hi @gleize, Thank you for your information. I will close the issue. |
Hi,
Thank you for your excellent work.
I have some questions related to your work.
First, in your paper, you wrote about the Keypoint Head that "it is trained to identify keypoints with successful round-trip matches (defined by mutual-nearest-neighbor) among all others (unsuccessful)". I have some questions as follows:
I am looking forward to hearing from you soon.
Best regards,
Tuan
The text was updated successfully, but these errors were encountered: