1 Introduction
After reading Abstract, I communicate with ChatGPT (free version) and google search to get the following:
1.1 Background
- speed and accuracy of YOLOs are negatively affected by the NMS.
- obvious, but important?
- Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS.
- Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector
- from Baidu Inc + Peking University
- 2 implementations: PaddlePaddle & Pytorch
- can be used in hardware / cloud platform of Baidu
- surpass yolo v8 in AP and speed.
- RT-DETR-R50 outperforms DINO-Deformable-DETR-R50 by 2.2% AP (53.1% AP vs
50.9% AP) and by about 21 times in FPS (108 FPS vs 5 FPS)
1.2 How to speed up?
two steps
- efficient hybrid encoder: maintaining accuracy while improving speed
- process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed
- uncertainty-minimal query selection: maintaining speed while improving accuracy
- to provide high-quality initial queries to the decoder, thereby improving accuracy
1.3 How slow is NMS? How important is it?
speed and accuracy of YOLOs are negatively affected by the NMS: obvious, but important?
- tab 1 (YOLOv8 on the COCO val2017) & fig 2
- execution time of the EfficientNMS kernel increases as
- the confidence threshold decreases or
- the IoU threshold increases
- high IoU threshold filters out fewer prediction boxes in each round of screening
- AP drops: 52.9 → 51.2
- 可占总推理时间的 20~40%
- NMS 的执行时间在 YOLOv8 的 TensorRT 实现中,最多达到 2.46 毫秒/张图像(T4 GPU,FP16),占比相当明显,尤其是在模型本体推理时间仅 5~10ms 的背景下。
1.4 low AP
- RTDETR-R101 achieves 56.2% AP on COCO 这个AP太低了?64.5 box MAP @ paper with code on Jul 10 2025.
- https://paperswithcode.com/sota/object-detection-on-coco
- AP* in this paper
- AP (this is AP@[0.50:0.95]) = box mAP @ paper with code
- AP50
- AP75
- APS / APM / APL (for small/medium/large object sizes)
- report results on COCO val2017 (this paper), not test-dev2017 (paper with code).
- use 640×640 input resolution for fair comparison with real-time YOLOs — not 800×1333 as in most SOTA results.