1 Voxelization and Feature Encoding

1.1 hard vs dynamic voxelization

  1. VoxelNet, CVPR18 assigns N points to a buffer with size K × T × F,
    1. fixed point capacity T: the maximum number of points in a voxe
    2. K is the maximum number of voxels
    3. F represents the feature dimension.
  2. It formulates voxelization as a two stage process: grouping and sampling.
    1. grouping: points {pi} are assigned to voxels {vj} based on their spatial coordinates
    2. sampling: sub-samples a fixed T number of points from each voxel.

Untitled

hard voxelization (15F memory usage) vs dynamic voxelization (13F) hard voxelization drops one point in v1 and misses v2

1.2 Feature Encoding

Untitled

我们只关心Birds-eye view.

  1. drawbacks of bird-eye view
    1. point cloud becomes highly sparse at longer ranges
  2. perspective view can represent the LiDAR range image densely, and can have a corresponding tiling of the scene in the Spherical coordinate system.
    1. The shortcoming of perspective view: object shapes are not distance-invariant and objects can overlap heavily with each other in a cluttered scene.
  3. Therefore, it is desirable to utilize the complementary information from both views.

Untitled

dense convolution

References

End-to-end multi-view fusion for 3d object detection in lidar point clouds, PMLR20