1 Introduction

After reading Abstract, I communicate with ChatGPT (free version) and google search to get the following:

1.1 Background

  1. speed and accuracy of YOLOs are negatively affected by the NMS.
    1. obvious, but important?
  2. Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS.
  3. Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector
    1. from Baidu Inc + Peking University
    2. 2 implementations: PaddlePaddle & Pytorch
      1. can be used in hardware / cloud platform of Baidu
      2. surpass yolo v8 in AP and speed.
  4. RT-DETR-R50 outperforms DINO-Deformable-DETR-R50 by 2.2% AP (53.1% AP vs 50.9% AP) and by about 21 times in FPS (108 FPS vs 5 FPS)

1.2 How to speed up?

two steps

  1. efficient hybrid encoder: maintaining accuracy while improving speed
    1. process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed
  2. uncertainty-minimal query selection: maintaining speed while improving accuracy
    1. to provide high-quality initial queries to the decoder, thereby improving accuracy

1.3 How slow is NMS? How important is it?

speed and accuracy of YOLOs are negatively affected by the NMS: obvious, but important?

  1. tab 1 (YOLOv8 on the COCO val2017) & fig 2
    1. execution time of the EfficientNMS kernel increases as
      1. the confidence threshold decreases or
      2. the IoU threshold increases
        1. high IoU threshold filters out fewer prediction boxes in each round of screening
    2. AP drops: 52.9 → 51.2
  2. 可占总推理时间的 20~40%
    1. NMS 的执行时间在 YOLOv8 的 TensorRT 实现中,最多达到 2.46 毫秒/张图像(T4 GPU,FP16),占比相当明显,尤其是在模型本体推理时间仅 5~10ms 的背景下。

1.4 low AP

  1. RTDETR-R101 achieves 56.2% AP on COCO 这个AP太低了?64.5 box MAP @ paper with code on Jul 10 2025.
    1. https://paperswithcode.com/sota/object-detection-on-coco
  2. AP* in this paper
  3. report results on COCO val2017 (this paper), not test-dev2017 (paper with code).
  4. use 640×640 input resolution for fair comparison with real-time YOLOs — not 800×1333 as in most SOTA results.