MotionTrack: End-to-End Transformer-based Multi-Object Tracking with LIDAR-Camera Fusion放一起考虑

Abstract

Perspective-Aware Query Generation (PAQG) + Uncertainty-Aware Fusion (UAF)

Untitled

use common image backbone (e.g. ResNet [18], V2-99 [28]) and FPN [36]
common 3D LiDAR backbone (e.g. VoxelNet [82]) and FPN [36]
Perspective-aware query generation ⇒ high-quality 3D queries with perspective priors
1. perspective detector: consists of the coupled 2D (e.g. FCOS [61]) and monocular-3D (e.g. FCOS3D [66]) sub-networks.
  1. 2D sub-network ⇒ 2D boxes with 2d properties, such as confidence scores, and category label.
  2. monocular-3D sub-network ⇒ raw 3D attributes, i.e., depths d, rotation angles, sizes, and velocities
  3. project the 2d box centers into 3D space based on corresponding camera info
  4. append 3d attributes to the projected 3d boxes.
2. some objects may be overlooked, we preserve Nr randomly initialized query boxes.
3. 完全从image来，这3d info有多有效？
ROI-Aware Sampling
Uncertainty-Aware Fusion
1. where q¯i and fUA are the refined query feature and uncertainty-aware fusion function, respectively. Besides, Ucam and Ulid are the uncertainty of two modalities.
2. uncertainty: a function of the distance between the predicted and GT Boxes
  1. see eq11 & 12.

Why Perspective-aware query generation?

Recent works typically generate queries based on randomly distributed reference points [7, 71], anchor boxes [37] or pillars [40] in 3D space and optimize as net parameters, regardless of input data.
Detr3d [68] randomly generates queries
However, it has already been proved in 2D detection [76] that such input-independent queries will take extra effort in learning to move the query proposals towards ground-truth object targets.
2D detector usually exhibits excellent perception capability on distant and small objects
⇒ use predicted 2D boxes as query.

By 2024/03/08, ranking 1st on both validation set and test benchmark of nuScenes, outperforming all state-of-the-art 3D object detectors by a notable margin.
1. 比Fully Sparse Fusion for 3D Object Detection, TPAMI24 高好多。
  1. mAP / NDS: 74 / 77 vs 70 / 74 vs 77/78(**EA-LSS)**
2. 没在其它库比。