Untitled

Transform t: perturb the input image with small pads, scales, and horizontal flips and apply the model to each transformed image to extract a collection of low-resolution feature maps.
1. These small image jitters allow us to observe tiny differences in the output features and provide sub-feature information to train the upsampler.
F_{hr}: a latent high-resolution feature map from Joint bilateral Upsamling: Fhr = σ↑(f(x), x).
1. x be an input image
2. How to generate Fhr?
Multi-view loss:
1. s = N (f (t(x))) is a spatially-varying adaptive uncertainty (Hamilton et al., 2020) parameterized by a small linear network N.
2. This extra flexibility allows the network to learn when certain outlier features fundamentally cannot be upsampled.
TV los
1. TV作用大，log(s)小：see Figure 9: Qualitative ablation study across both DINO and Resnet50 Backbones

Method

CHOOSING A DOWNSAMPLER

Untitled

two options: a fast and simple learned blur kernel, and a more flexible attention-based downsampler
blur-based downsampler is efficient, it cannot capture dynamic receptive fields, object salience, or other nonlinear effects
spatially adapts the downsampling kernel
1. uses a 1x1 convolution, Conv(Fhr[…]), to predict a saliency map from Fhr.

two variants: “JBU” (Kopf et al., 2007), or Implicit, see tab 1.
Joint Bilateral Upsamplers (JBU)
1. This feedforward upsampler is a parameterized generalization, MLP based, of a Joint Bilateral Upsampling (JBU) filter (Kopf et al., 2007)
  1. Kopf et al., 2007: is just a traditional filter.
  2. ⇒ GT Fhr^, which may be also used in “Implicit”
2. each JBU is a two-layer GeLU (Hendrycks & Gimpel, 2016) MLP with 30-dimensional hidden and output vectors
Implicit
1. the component-wise discrete Fourier transform of an input signal z, with a vector of frequencies ωˆ.
2. : represent concatenation
3. 这里x没解释，可能是rgb or Fourier color features at position (ei, ej), fig 9 of sec 6.4

Untitled