One sentence: predicated one box for each voxel with high scores of each class & center-missing issue can also be simply skipped through sparse networks that have large receptive fields.
Abstract
- For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking
- Due to inherent sparsity and many background points, only a small number of points have responses, i.e., less than 1% for Car class on average of nuScenes validation set.
- Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark
- better than FSD V1, nips22
- VoxelNeXt achieves leading performance with high efficiency on 3D object detection on both these benchmarks: nuScenes [3], Waymo [44], Argoverse2
- no need for sparse-to-dense conversion or NMS postprocessing

-
CenterPoint, cvpr21 predictes one center per position of a dense feature map, i.e. dense prediction head
- ⇒ waste much computation
- ⇒ NMS, preventing the detector from being elegant.

on nuScenes

- FSD V1, nips22 is complicated by its heavy belief in object centers.
- voting process inevitably introduces bias or error
- it is promising at the large-range Argoverse2, while its efficiency is inferior to ours.
- mAP比FSD V2, 23 差很多,但是没比时间
2 Related work
Sparse CNN
- Sparse CNNs become mainframe backbone networks in 3D deep learning [10, 11,
23, 41] for its efficiency. But its representation ability is limited for prediction.
- To remedy it, 3D detectors of [12, 41, 49, 53] rely on dense convolutional heads for feature enhancement.
- Recent methods [6, 32] make convolutional modifications upon sparse CNNs.
- Approaches of [21, 35] even substitute it with transformers for
large receptive fields.
Sparse detector
- Methods of [16, 45, 46] avoid dense detection heads and instead introduce other complicated pipelines.
- RSN [46] performs foreground segmentation on range images and then detects 3D objects on the remained sparse data.
- SWFormer [45] proposes a sparse transformer with delicate window splitting and multiple heads with feature pyramids.
- SST, cvpr22 二者区别?
- plain sparse CNN backbone network has been widely used in 3D object detectors [12, 40, 57]