1 Voxelization and Feature Encoding

1.1 hard vs dynamic voxelization

VoxelNet, CVPR18 assigns N points to a buffer with size K × T × F,
1. fixed point capacity T: the maximum number of points in a voxe
2. K is the maximum number of voxels
3. F represents the feature dimension.
It formulates voxelization as a two stage process: grouping and sampling.
1. grouping: points {pi} are assigned to voxels {vj} based on their spatial coordinates
2. sampling: sub-samples a fixed T number of points from each voxel.

Untitled

hard voxelization (15F memory usage) vs dynamic voxelization (13F) hard voxelization drops one point in v1 and misses v2

Untitled

我们只关心Birds-eye view.

drawbacks of bird-eye view
1. point cloud becomes highly sparse at longer ranges
perspective view can represent the LiDAR range image densely, and can have a corresponding tiling of the scene in the Spherical coordinate system.
1. The shortcoming of perspective view: object shapes are not distance-invariant and objects can overlap heavily with each other in a cluttered scene.
Therefore, it is desirable to utilize the complementary information from both views.

Untitled

dense convolution

End-to-end multi-view fusion for 3d object detection in lidar point clouds, PMLR20