没写清楚,还没code

Introduction

Untitled

(a) IAD: without definite results and comprehension description.

(b) MiniGPT-4 cannot generate IAD-domain description.

  1. MiniGPT-4 [38]: large multimodal model which is built upon Vicuna with pre-trained ViT backbone from EVA-CLIP [9]
    1. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  2. Knowledge about anomaly detection is absent in existing general LMMs,
    1. training a specific LMM for anomaly detection requires a tremendous amount of annotated data and massive computation resources.

(c) AnomalyGPT [11] designs a LLM-based image decoder to generate anomaly map and employs prompt embedding to finetune the LMM.

But it still fail to utilize vision comprehension capacity of large multimodal models.

Method

Untitled

  1. Vision Expert: fixed ⇒ 本文只是在解释input image和anomaly map。
  2. Vision Expert Tokenizer
    1. several blocks which consist of a convolution with 3×3 kernel, a ReLU as the activation function and a max pooling.
  3. Instructor: 没说明,看起来结构和Tokenizer一致,只是生成了适合QFormer的token
  4. Pretrain Query tokens: 没描述
  5. Q-Former: a query transformer from pretrained BLIP-2 [15], a kind of cross attention.
  6. Myriad achieves generalization performance in unseen scenarios, relative to Vision Expert.

Experiments

tab 2: better than WinCLIP, AnomalyGPT for 0-shot to 4-shot.

References

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection, 23