Introduction

  1. Several methods project point clouds into a perspective view and apply image-based feature extraction techniques [28, 15, 22].
  2. Other approaches rasterize point clouds into a 3D voxel grid and encode each voxel with handcrafted features [41, 9, 37, 38, 21, 5], see [5] MV, CVPR 2017
    1. A major breakthrough in recognition [20] and detection [13] tasks on images was due to moving from hand-crafted features to machine-learned features.
  3. PointNet, cvpr17 and PointNet++ usually use ∼1k points.
    1. LiDAR ⇒ ∼100k points, training with them ⇒ high computational and memory requirements.

Method

Untitled

  1. Feature learning network ⇒ initial voxel feature from points inside each voxel.
    1. Grouping: group the points according to the voxel they reside in. Note that the points are non-uniform distributed in each voxel.
    2. Random sampling: sample T points for each voxel:
      1. (1) computational savings;
      2. (2) decreases the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.
    3. VEF Layer-1: just PointNet for the points inside a voxel. Parallel for each non-empty voxel.
      1. gray: MaxPooled feature of the points inside a voxel.
      2. color: point wise feature.
      3. stack several VEF layers ⇒ allows learning complex features for characterizing local 3D shape information.
      4. 也只不过是充分inter-point interaction within a voxel
  2. Convolution Middle layers ⇒ better voxel feature across voxels
    1. input: Sparse Tensor Representation
      1. ∼100k points, but more than 90% of voxels typically are empty
      2. so Representing non-empty voxel features as a sparse tensor: D’, H’, W’ for length, height, width; C for channels
    2. Apply dense Conv3D on sparse tensor ⇒ better voxel feature 【用的正常的Conv3D,sparse tensor只起到了存储作用】
    3. 实际上是在3D 或 bird view两种表示上做的,see table 1 &2。bird view的做法参考[5],相当于volume grid.
  3. Region proposal network (RPN)

Efficient implementation before sparse tensor

Untitled

Voxel input feature buffer: K is the maximum number of non-empty voxels, T is the maximum number of points per voxel, and 7 is the input encoding dimension for each point.

方便voxel内的pointnet和后续的sparse tensor

Voxel coordinate buffer

Experiments

Untitled

  1. BV: bird view; FV: front view.
  2. hand-crafted baseline (HC-baseline): VoxelNet architecture but using hand-crafted features
    1. To analyze the importance of end-to-end learning is better than using hand-crafted features
    2. HC-baseline uses the bird’s eye view features described in [5]

References

VoxelNet: End-to-end learning for point cloud based 3d object detection, CVPR, 2018