

Dhyper纬度高于作者用的vgg-19,所以1x1得到D。
policy is defined as a probability distribution over actions A, conditioned on the current state S and parameterized by θ
$$ \pi_\theta(\mathcal{S}) = \mathbb{P}[\mathcal{A} \mid \mathcal{S}, \theta]. $$
The learning objective is then formulated as maximizing the expected cumulative reward over a trajectory τ (a sequence of state, action, reward tuples) scaled by the reward R:
$$ \max_{\theta} J(\theta) = \mathbb{E}{x \sim \pi{\theta}} \left[ R(\tau) \right] $$
Agent? trajectory**?** reward R:? ⇒ later
REINFORCE [45] provides an approximation for the derivative
$$ \nabla_\theta J(\theta) \approx \hat{g} = \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R(\tau). $$
encoder-decoder network acts as a trainable policy, with the input image I representing the state, keypoint localization corresponds to an action
$$ \pi_\theta(\mathbf{s}) = d_\theta(e_\theta(\mathbf{I})) = \mathbf{p}= \big[ \mathbb{P}_1[a_1 \mid I, \theta],\, \mathbb{P}_2[a_2 \mid I, \theta],\, \dots,\, \mathbb{P}_c[a_c \mid I, \theta] \big], $$
where p is a list of distributions for each cell c in the heatmap H.
