https://github.com/tusen-ai/sst
理解本文需要理解VoteNet和FSDv1的细节。
Abstract
- use of virtual voxels as an alternative to clustering in FSD v1
- ⇒ eliminating the inductive bias ⇒ better general applicability.
- ⇒ simplify FSD v1: a more elegant and streamlined approach
- FSD v1: sparse CNN on voxels + point cloud network
- FSD v2: operate only on voxels?
- no. still have pointwise classification & center voting.
- SOTA performance on Waymo Open, Argoverse 2 and nuScenes
Introduction
FDS v1 employs an instance-level representation (cluster, SIR) introduces strong inductive bias, impeding the general applicability
- Point Feature Extraction via sparse CNN on voxels
- point-wise classification and center voting based on MLP
- foreground points ⇒ voted centers
- Clustering.
- Connected Component Labeling (CCL) is applied to the voted centers to
cluster points into instances.
- SIR: Instance feature extraction and box prediction via “PointNet”
Treatments to Center Feature Missing

FSDv2 replaces clusters in FSDv1 with virtual voxels (red voxels) from the voted centers (red points)
- virtual voxels are derived by voxelizing the voted centers.
- virtual: because the voted centers are artificial and not the real points obtained by sensors
- discarding its instance-level representation, pursuing better general applicability
- virtual voxel ⇒ box? no
- a virtual voxel may only contain a partial set of the voted centers
- ⇒ a light-weight sparse Virtual Voxel Mixer (VVM)
- aggregate the features of different virtual voxels belonging to a specific object, resulting in better features covering the whole instance
- VVM intuitively mimics the behavior of SIR in FSD v1, but does not dependent on explicitly generated instances