
N f_Q ⇒ M INPs $P=\{ P_1,...,P_M\vert P_i\in R^C \},P \in R^{M\times C}$
M learnable tokens $T=\{ T_1,...,T_M\vert T_i\in R^C \}, T \in R^{M\times C}$
allowing T to linearly aggregate F_Q into INPs P
$$ F_Q = \text{sum}(\{f^1_Q, \dots, f^L_Q\})\\ Q = TW_Q, K = F_QW_K, V = F_QW_V\\ T' = \text{Attention}(Q, K, V) + T\\ P = \text{FFN}(T') + T' \quad $$
Linear Projections: Before the attention mechanism, Q, K, and V are formed by applying linear projections (matrix multiplications with learnable parameters W_Q, W_K, W_V) to T and F_Q respectively
FFN: feed forward network
INP coherence loss Lc
trained on normal training images only
⇒ INPs directly from the test image during testing phase.
the above 2 factors ⇒ superior performance across multi-class, singleclass, and few-shot AD tasks

Transformer / attention: employ the extracted INPs as key-value pairs, ensuring that the output is a linear combination of normal INPs
$$ Q_l = f^{l-1}D W^l_Q ,K_l = P W^l_K , V_l = P W^l_V \\ f^{l-1}{D \prime} = A_l V_l, A_l = \text{ReLU}(Q_l(K_l)^T ) \\ f^l_D = \text{FFN}(f^{l-1}{D \prime} ) + f^{l-1}{D \prime} $$
Following the previous work [21], we also employ the ReLU activation function on the attention weights (QK) to mitigate the influence of weak correlations and noise on the attention maps.
the first residual connection can directly introduce anomalous features into the subsequent reconstruction ⇒ removed
A Unified Model for Multi-class Anomaly Detection, nips22,通过Neighbor Masked Attention完成编解码,可能有the idenfical mapping issue。本文通过INP避免之。