https://github.com/TuSimple/SST
No feature map downsampling! Why?

the ratio of the object size to scene size is significantly smaller compared to 2D detection cases.
- Distribution of the relative object size Srel in COCO dataset [28] and Waymo Open Dataset (WOD).
- Srel is defined as sqrt(Ao/As), where Ao denotes the area of 2D objects (COCO)
and the BEV area of 3D objects (WOD).
- As is the image area in COCO, and 150m × 150m in WOD.
- In COCO 73.03% objects in COCO have a Srel larger than 0.04, while only 0.54% objects in WOD have a Srel larger than 0.04.

multi-stride 3D detectors vs our Single-stride Sparse Transformer (SST)
- Overlooking this difference, many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps ⇒ bring few advantages, and lead to inevitable information loss.
- SST: maintain the original resolution
- addresses the problem of insufficient receptive field in single-stride architectures.
- It also cooperates well with the sparsity of point clouds and naturally avoids expensive computation.
- achieve exciting performance (83.8 LEVEL 1 AP on validation split) on small object (pedestrian) detection due to the characteristic of single stride. [Waymo Dataset (75m).]
- The backbone of SST shows higher performance and lower speed than sparse-convolution-based U-Net on voxels, shown in FSD V1, nips22

Architecture overview for SST

- voxelizes the point clouds and extracts voxel features following prior work [20, 62, 70]
- For each voxel and its features, SST treats them as “tokens.”
- Regional Grouping: partitions the voxelized 3D space to fixed-size non-overlapping regions