Introduction

Several methods project point clouds into a perspective view and apply image-based feature extraction techniques [28, 15, 22].
Other approaches rasterize point clouds into a 3D voxel grid and encode each voxel with handcrafted features [41, 9, 37, 38, 21, 5], see [5] MV, CVPR 2017
1. A major breakthrough in recognition [20] and detection [13] tasks on images was due to moving from hand-crafted features to machine-learned features.
PointNet, cvpr17 and PointNet++ usually use ∼1k points.
1. LiDAR ⇒ ∼100k points, training with them ⇒ high computational and memory requirements.

Method

Untitled

Feature learning network ⇒ initial voxel feature from points inside each voxel.
1. Grouping: group the points according to the voxel they reside in. Note that the points are non-uniform distributed in each voxel.
2. Random sampling: sample T points for each voxel:
  1. (1) computational savings;
  2. (2) decreases the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.
3. VEF Layer-1: just PointNet for the points inside a voxel. Parallel for each non-empty voxel.
  1. gray: MaxPooled feature of the points inside a voxel.
  2. color: point wise feature.
  3. stack several VEF layers ⇒ allows learning complex features for characterizing local 3D shape information.
  4. 也只不过是充分inter-point interaction within a voxel
Convolution Middle layers ⇒ better voxel feature across voxels
1. input: Sparse Tensor Representation
  1. ∼100k points, but more than 90% of voxels typically are empty
  2. so Representing non-empty voxel features as a sparse tensor: D’, H’, W’ for length, height, width; C for channels
2. Apply dense Conv3D on sparse tensor ⇒ better voxel feature 【用的正常的Conv3D，sparse tensor只起到了存储作用】
3. 实际上是在3D 或 bird view两种表示上做的，see table 1 &2。bird view的做法参考[5]，相当于volume grid.
Region proposal network (RPN)

Efficient implementation before sparse tensor

Untitled

Voxel input feature buffer: K is the maximum number of non-empty voxels, T is the maximum number of points per voxel, and 7 is the input encoding dimension for each point.

方便voxel内的pointnet和后续的sparse tensor

Voxel coordinate buffer

Experiments

Untitled

BV: bird view; FV: front view.
hand-crafted baseline (HC-baseline): VoxelNet architecture but using hand-crafted features
1. To analyze the importance of end-to-end learning is better than using hand-crafted features
2. HC-baseline uses the bird’s eye view features described in [5]

References

VoxelNet: End-to-end learning for point cloud based 3d object detection, CVPR, 2018