One sentence: predicated one box for each voxel with high scores of each class & center-missing issue can also be simply skipped through sparse networks that have large receptive fields.

Abstract

For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking
1. Due to inherent sparsity and many background points, only a small number of points have responses, i.e., less than 1% for Car class on average of nuScenes validation set.
Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark
1. better than FSD V1, nips22
2. VoxelNeXt achieves leading performance with high efficiency on 3D object detection on both these benchmarks: nuScenes [3], Waymo [44], Argoverse2
no need for sparse-to-dense conversion or NMS postprocessing

Untitled

CenterPoint, cvpr21 predictes one center per position of a dense feature map, i.e. dense prediction head
1. ⇒ waste much computation
2. ⇒ NMS, preventing the detector from being elegant.
on nuScenes

Untitled

FSD V1, nips22 is complicated by its heavy belief in object centers.
1. voting process inevitably introduces bias or error
it is promising at the large-range Argoverse2, while its efficiency is inferior to ours.
1. mAP比FSD V2, 23 差很多，但是没比时间

2 Related work

Sparse CNN

Sparse CNNs become mainframe backbone networks in 3D deep learning [10, 11, 23, 41] for its efficiency. But its representation ability is limited for prediction.
To remedy it, 3D detectors of [12, 41, 49, 53] rely on dense convolutional heads for feature enhancement.
Recent methods [6, 32] make convolutional modifications upon sparse CNNs.
Approaches of [21, 35] even substitute it with transformers for large receptive fields.

Sparse detector

Methods of [16, 45, 46] avoid dense detection heads and instead introduce other complicated pipelines.
1. RSN [46] performs foreground segmentation on range images and then detects 3D objects on the remained sparse data.
2. SWFormer [45] proposes a sparse transformer with delicate window splitting and multiple heads with feature pyramids.
  1. SST, cvpr22 二者区别？
plain sparse CNN backbone network has been widely used in 3D object detectors [12, 40, 57]