https://github.com/TuSimple/SST

No feature map downsampling! Why?

Untitled

the ratio of the object size to scene size is significantly smaller compared to 2D detection cases.

  1. Distribution of the relative object size Srel in COCO dataset [28] and Waymo Open Dataset (WOD).
  2. Srel is defined as sqrt(Ao/As), where Ao denotes the area of 2D objects (COCO) and the BEV area of 3D objects (WOD).
  3. As is the image area in COCO, and 150m × 150m in WOD.
  4. In COCO 73.03% objects in COCO have a Srel larger than 0.04, while only 0.54% objects in WOD have a Srel larger than 0.04.

Untitled

multi-stride 3D detectors vs our Single-stride Sparse Transformer (SST)

  1. Overlooking this difference, many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps ⇒ bring few advantages, and lead to inevitable information loss.
  2. SST: maintain the original resolution
    1. addresses the problem of insufficient receptive field in single-stride architectures.
    2. It also cooperates well with the sparsity of point clouds and naturally avoids expensive computation.
    3. achieve exciting performance (83.8 LEVEL 1 AP on validation split) on small object (pedestrian) detection due to the characteristic of single stride. [Waymo Dataset (75m).]
  3. The backbone of SST shows higher performance and lower speed than sparse-convolution-based U-Net on voxels, shown in FSD V1, nips22

Untitled

Architecture overview for SST

Untitled

  1. voxelizes the point clouds and extracts voxel features following prior work [20, 62, 70]
  2. For each voxel and its features, SST treats them as “tokens.”
  3. Regional Grouping: partitions the voxelized 3D space to fixed-size non-overlapping regions