Introduction
- Several methods project point clouds into a perspective view and apply image-based feature extraction techniques [28, 15, 22].
- Other approaches rasterize point clouds into a 3D voxel grid and encode each voxel with handcrafted features [41, 9, 37, 38, 21, 5], see [5] MV, CVPR 2017
- A major breakthrough in recognition [20] and detection [13] tasks on images was due to moving from hand-crafted features to machine-learned features.
- PointNet, cvpr17 and PointNet++ usually use ∼1k points.
- LiDAR ⇒ ∼100k points, training with them ⇒ high computational and memory requirements.
Method

- Feature learning network ⇒ initial voxel feature from points inside each voxel.
- Grouping: group the points according to the voxel they reside in. Note that the points are non-uniform distributed in each voxel.
- Random sampling: sample T points for each voxel:
- (1) computational savings;
- (2) decreases the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.
- VEF Layer-1: just PointNet for the points inside a voxel. Parallel for each non-empty voxel.
- gray: MaxPooled feature of the points inside a voxel.
- color: point wise feature.
- stack several VEF layers ⇒ allows learning complex features for characterizing local 3D shape information.
- 也只不过是充分inter-point interaction within a voxel
- Convolution Middle layers ⇒ better voxel feature across voxels
- input: Sparse Tensor Representation
- ∼100k points, but more than 90% of voxels typically are empty
- so Representing non-empty voxel features as a sparse tensor: D’, H’, W’ for length, height, width; C for channels
- Apply dense Conv3D on sparse tensor ⇒ better voxel feature 【用的正常的Conv3D,sparse tensor只起到了存储作用】
- 实际上是在3D 或 bird view两种表示上做的,see table 1 &2。bird view的做法参考[5],相当于volume grid.
- Region proposal network (RPN)
Efficient implementation before sparse tensor

Voxel input feature buffer: K is the maximum number of non-empty voxels, T is the maximum number of points per voxel, and 7 is the input encoding dimension for each point.
方便voxel内的pointnet和后续的sparse tensor
Voxel coordinate buffer
Experiments

- BV: bird view; FV: front view.
- hand-crafted baseline (HC-baseline): VoxelNet architecture but using hand-crafted features
- To analyze the importance of end-to-end learning is better than using hand-crafted features
- HC-baseline uses the bird’s eye view features described in [5]
References
VoxelNet: End-to-end learning for point cloud based 3d object detection, CVPR, 2018