没写清楚,还没code
Introduction

(a) IAD: without definite results and comprehension description.
(b) MiniGPT-4 cannot generate IAD-domain description.
- MiniGPT-4 [38]: large multimodal model which is built upon Vicuna with pre-trained ViT backbone from EVA-CLIP [9]
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Knowledge about anomaly detection is absent in existing general LMMs,
- training a specific LMM for anomaly detection requires a tremendous amount of annotated data and massive computation resources.
(c) AnomalyGPT [11] designs a LLM-based image decoder to generate anomaly map and employs prompt embedding to finetune the LMM.
But it still fail to utilize vision comprehension capacity of large multimodal models.
Method

- Vision Expert: fixed ⇒ 本文只是在解释input image和anomaly map。
- Vision Expert Tokenizer
- several blocks which consist of a convolution with 3×3 kernel, a ReLU as the activation function and a max pooling.
- Instructor: 没说明,看起来结构和Tokenizer一致,只是生成了适合QFormer的token
- Pretrain Query tokens: 没描述
- Q-Former: a query transformer from pretrained BLIP-2 [15], a kind of cross attention.
- Myriad achieves generalization performance in unseen scenarios, relative to Vision Expert.
Experiments
tab 2: better than WinCLIP, AnomalyGPT for 0-shot to 4-shot.
References
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection, 23