https://github.com/TuSimple/SST
0 Abstract
-
metrits
-
FSD achieves SOTA performance and is 2.4× faster than the dense counterpart.
-
tested on Argoverse 2 Dataset (200m) & Waymo Open Dataset (75m)

Fig 2: Short-range point clouds (red, from KITTI [2]) v.s. long-range point clouds (blue, from Argoverse 2 [4]). The radius of the red circle is 75 meters. The sparsity quickly increases as the range extends.
-
backgrounds / why sparse?
- Mainstream 3D object detectors usually build dense feature maps. Both dense Detectors & Semi-dense Detectors are hard to scale up to long-range scenarios due to the requirement of dense feature maps in the detection head.
- dense bev feature maps is quadratic to the perception range?应该在说bev feature
- the sparsity of LiDAR point clouds also increases as the perception range extends (see Fig.2), and the calculation on the unoccupied area is essentially unnecessary
- ⇒ roughly linear to the number of points and independent of the perception range
- 是sparse本身的好处,不为FSD所独有
- ⇒ the need of sparse detection
-
Components:
- sparse voxel encoder + clustering based on CCL + a novel sparse instance recognition (SIR) module
1 Introduction
1.1 Categories
- Point-based sparse detectors: [slow: time-consuming neighborhood query]
- PointRCNN [27] 3DSSD [40]
- VoteNet, iccv19 oral, Best Paper Nominee [23] first makes a center voting and then generates proposals from the voted center achieving better precision.
- its point feature extraction network is PointNet++
- SIR avoids the time-consuming neighbor queries in previous point-based methods by grouping points into instances [作者在这里把grouping也算到SIR中了,后文中SIR是从grouping之后开始算的]
- VoteNet也是instance level的,也没有这个问题。
- Albeit many methods have tried to accelerate the point-based method, the time-consuming neighborhood query is still unaffordable in large-scale point clouds (more than 100k points per scene).
- So current benchmarks [31, 2] with large-scale point clouds are dominated by voxel-based dense/semi-dense detectors [11, 30, 15].
- Voxel-based dense detectors: sparse point cloud 2 dense feature maps [high cost in mem]
- the detectors utilizing dense feature maps, from dense convolution, as dense detectors.
- Pioneering work VoxelNet, CVPR18 [45] apply dense convolution to 3D voxel representation: Convolution Middle layers.
- PIXOR [39] and PointPillars [13] adopt 2D dense convolution in Bird’s Eye View (BEV) feature map achieving significant efficiency improvement
- Voxel-based semi-dense detectors: incorporate both sparse features and dense features.
- SECOND [38] adopts sparse convolution to extract the sparse voxel features in 3D space, which then are converted to dense feature maps in BEV to enlarge the receptive field.
1.2 Challenges of sparse detectors: CFM
1.2.1 Center feature is important
Almost all popular voxel or pillar based detectors [28, 5, 42, 30, 38] in this field adopt center-based or anchor based assignment since the center feature is the best representation of the whole object.
肯定比其它跑偏的好。
1.3.2 Center Feature Missing (CFM)
sparse feature map ⇒ a challenge: center feature missing (CFM)
todo: 画一个2d中的,采样在curve上的示意图。对比2d中的,采样在projected 3d surface上的示意图。

illustration of CFM and feature diffusion on dense feature maps from Bird’s Eye View. The empty instance center (red dot) is filled by the features diffused from occupied voxels (with LiDAR points), after several convolutions.