Keypoint detector & description

等分heatmap H into patches/cells ci, ci holds mxm logit values
1. logit: "未归一化的原始分数" （描述性翻译，准确传达其技术含义）
ci ⇒ just one potential keypoint position si with logit li and final probability pi
- select si with logits **in ci
- rejected keypoints: orange; accepted keypoints: green
- the we get 3 lists from input image：

Keypoint description

Dhyper纬度高于作者用的vgg-19，所以1x1得到D。

policy is defined as a probability distribution over actions A, conditioned on the current state S and parameterized by θ

$$ \pi_\theta(\mathcal{S}) = \mathbb{P}[\mathcal{A} \mid \mathcal{S}, \theta]. $$
1. This constitutes a probability distribution, from which an action A is sampled.
2. Based on the sampled action, the agent receives a reward signal that indicates a good or bad action.
The learning objective is then formulated as maximizing the expected cumulative reward over a trajectory τ (a sequence of state, action, reward tuples) scaled by the reward R:

$$ \max_{\theta} J(\theta) = \mathbb{E}{x \sim \pi{\theta}} \left[ R(\tau) \right] $$

Agent? trajectory**?** reward R:? ⇒ later
REINFORCE [45] provides an approximation for the derivative

$$ \nabla_\theta J(\theta) \approx \hat{g} = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau). $$

encoder-decoder network acts as a trainable policy, with the input image I representing the state, keypoint localization corresponds to an action

$$ \pi_\theta(\mathbf{s}) = d_\theta(e_\theta(\mathbf{I})) = \mathbf{p}= \big[ \mathbb{P}_1[a_1 \mid I, \theta],\, \mathbb{P}_2[a_2 \mid I, \theta],\, \dots,\, \mathbb{P}_c[a_c \mid I, \theta] \big], $$

where p is a list of distributions for each cell c in the heatmap H.

A network/ agent generates probability distributions over potential keypoint locations.