理解本文需要理解VoteNet和FSDv1的细节。

Abstract

Introduction

FDS v1 employs an instance-level representation (cluster, SIR) introduces strong inductive bias, impeding the general applicability

Point Feature Extraction via sparse CNN on voxels
point-wise classification and center voting based on MLP
1. foreground points ⇒ voted centers
Clustering.
1. Connected Component Labeling (CCL) is applied to the voted centers to cluster points into instances.
SIR: Instance feature extraction and box prediction via “PointNet”

Untitled

FSDv2 replaces clusters in FSDv1 with virtual voxels (red voxels) from the voted centers (red points)

virtual voxels are derived by voxelizing the voted centers.
1. virtual: because the voted centers are artificial and not the real points obtained by sensors
discarding its instance-level representation, pursuing better general applicability
virtual voxel ⇒ box? no
1. a virtual voxel may only contain a partial set of the voted centers
⇒ a light-weight sparse Virtual Voxel Mixer (VVM)
1. aggregate the features of different virtual voxels belonging to a specific object, resulting in better features covering the whole instance
2. VVM intuitively mimics the behavior of SIR in FSD v1, but does not dependent on explicitly generated instances