MotionTrack: End-to-End Transformer-based Multi-Object Tracking with LIDAR-Camera Fusion放一起考虑
Perspective-Aware Query Generation (PAQG) + Uncertainty-Aware Fusion (UAF)

use common image backbone (e.g. ResNet [18], V2-99 [28]) and FPN [36]
common 3D LiDAR backbone (e.g. VoxelNet [82]) and FPN [36]
Perspective-aware query generation ⇒ high-quality 3D queries with perspective priors
ROI-Aware Sampling
Uncertainty-Aware Fusion

Recent works typically generate queries based on randomly distributed reference points [7, 71], anchor boxes [37] or pillars [40] in 3D space and optimize as net parameters, regardless of input data.
Detr3d [68] randomly generates queries
However, it has already been proved in 2D detection [76] that such input-independent queries will take extra effort in learning to move the query proposals towards ground-truth object targets.
2D detector usually exhibits excellent perception capability on distant and small objects

⇒ use predicted 2D boxes as query.