1 Introduction

  1. What is INP (Intrinsic Normal Prototype)?
    1. cluster centers represented by patch tokens
  2. https://github.com/luow23/INP-Former
    1. INP-Former++, 25 (https://arxiv.org/pdf/2506.03660)
      1. INP-Former + pseudo-anomaly generation & residual learning / segmentation network: anomaly heat map ⇒ refined anomaly map
      2. to prevent shortcut solutions, where most features are assigned to the same INP: INP Coherence Loss ⇒ Soft INP Coherence Loss
  3. key insight for AD: INPs are dynamically extracted just from the single test image ⇒ no misaligned normality between pre-stored prototypes (from training images) and the test image
    1. no misalignment ⇒ Intrinsic; but 这词用的不妥,misalignments are solved by extracting INPs from the single test image
    2. ⇒ on-site normal prototype is a proper name, it has no problem of alignment.
    3. 这2年2d AD文章都在讨论misalignment即pose + illumination independent
      1. diffusion: test image with anomaly ⇒ test image without anomaly, then compare them ⇒ pose + illumination independent
        1. 相比之前的diffusion based AD,One-to-Normal, nips25 做的有多彻底呢?
          1. 我总觉得靠diffusion,完美精确移除anomaly不够靠谱,多少会额外破坏些,引入些,歪曲些。
      2. reconstruction
        1. 之前reconstruction based AD仅靠从training images训练出来的network,本文增加了从当前test image提取的INP的指导,故而更佳。
    4. 我们对3d AD问题考虑这两个因素=》AD via 3DGS: learn the camera pose and illumination of the test image.
  4. General / Methodological significance: clustering by transformer

2 The Method: four key modules

2.1 fixed pre-trained Encoder Q

  1. Image I ⇒ N multi-scale latent features f_Q
  2. $f_Q=\{f_Q^1,...,f_Q^L\vert f_Q^i\in R^{N\times C}\}$, C: |channels|, feature scales from 1 to L.

2.2 INP Extractor E

image.png

  1. N f_Q ⇒ M INPs $P=\{ P_1,...,P_M\vert P_i\in R^C \},P \in R^{M\times C}$

    1. M learnable tokens $T=\{ T_1,...,T_M\vert T_i\in R^C \}, T \in R^{M\times C}$

    2. allowing T to linearly aggregate F_Q into INPs P

      $$ F_Q = \text{sum}(\{f^1_Q, \dots, f^L_Q\})\\ Q = TW_Q, K = F_QW_K, V = F_QW_V\\ T' = \text{Attention}(Q, K, V) + T\\ P = \text{FFN}(T') + T' \quad $$

    3. Linear Projections: Before the attention mechanism, Q, K, and V are formed by applying linear projections (matrix multiplications with learnable parameters W_Q, W_K, W_V) to T and F_Q respectively

    4. FFN: feed forward network

    5. INP coherence loss Lc

      1. 和最后的重建loss: L_sm一起约束P的学习;
      2. 一个正常clustering loss, robust吗?
  2. trained on normal training images only

  3. ⇒ INPs directly from the test image during testing phase.

    1. only on a single test image, no INPs from training images ⇒ 6 INPs is the best.
    2. aligned normality for anomaly detection
  4. the above 2 factors ⇒ superior performance across multi-class, singleclass, and few-shot AD tasks

2.3 Bottleneck B: FB = B(fQ)

2.4 INP guided decoder D

image.png

  1. P guided reconstruction from fB ⇒ fD
    1. Transformer / attention: employ the extracted INPs as key-value pairs, ensuring that the output is a linear combination of normal INPs

      $$ Q_l = f^{l-1}D W^l_Q ,K_l = P W^l_K , V_l = P W^l_V \\ f^{l-1}{D \prime} = A_l V_l, A_l = \text{ReLU}(Q_l(K_l)^T ) \\ f^l_D = \text{FFN}(f^{l-1}{D \prime} ) + f^{l-1}{D \prime} $$

    2. Following the previous work [21], we also employ the ReLU activation function on the attention weights (QK) to mitigate the influence of weak correlations and noise on the attention maps.

    3. the first residual connection can directly introduce anomalous features into the subsequent reconstruction ⇒ removed

    4. A Unified Model for Multi-class Anomaly Detection, nips22,通过Neighbor Masked Attention完成编解码,可能有the idenfical mapping issue。本文通过INP避免之。