code: https://github.com/aritra0593/Reinforced-Feature-Points

1Intro

1.1 Why

  1. Training detector networks usually resorts to optimizing low-level matching scores, often pre-defining sets of image patches which should or should not match,
  2. Unfortunately, increased accuracy for these low-level matching scores does not necessarily translate to better performance in high-level vision tasks.
  3. ⇒ a new training methodology which training the feature detector in a “higher level” task.
    1. the task is relative pose estimation between a pair of images.
    2. ⇒ better performance for high-level tasks, such as pose estimation.
    3. actually a fine-tune of pretrained SuperPoint [14], which supervised by low-level matching scores.
      1. it seems that the network can not be trained from random initialization.

2 Method

2.1 Relative pose estimation

  1. find the essential matrix, E, which maximises the inlier count among all correspondences.
    1. (ql, qr) is an inlier if qr is close to the epipolar line, defined by Eql, below a threshold
    2. qr’Eql=0
      1. 9 Essential & fundamental matrices
    3. a robust estimator like RANSAC [17] with a 5-point solver [33] ⇒ Essential matrix
  2. Essential matrix decomposition
    1. E=R[T]x ⇒ transformation T^ in the paper, which is rotation R and translation T.

Untitled

  1. from 1) and 2) ⇒ A set of M matches M = {mij} between I and I′ defined by independent samples X and X’.
    1. a match, mij = (xi, x′j), between two key points xi and x’j
  2. We treat the vision task as a (potentially non-differentiable) black box ⇒ T^.
    1. Supervised by GT camera transformation T*.
    2. The black box provides an error signal l(M, X, X’)=fun(T^, T*), used to reinforce the key point and matching probabilities.
      1. but the gradient of l(M, X, X’) is not needed.

2.2 reinforcement learning

2.2.1 Why

  1. we cannot directly propagate gradients of our estimated transformation Tˆ back to update the network weights, as in standard supervised learning.
  2. Components of our vision pipeline, like the robust estimator (e.g. RANSAC [17]) or the minimal solver (e.g. the 5-point solver [33]) might also be non-differentiable.
  3. To optimize the neural network parameters for our task, we apply principles from reinforcement learning [48].

2.2.2 How