1 Voxelization and Feature Encoding
1.1 hard vs dynamic voxelization
- VoxelNet, CVPR18 assigns N points to a buffer with size K × T × F,
- fixed point capacity T: the maximum number of points in a voxe
- K is the maximum number of voxels
- F represents the feature dimension.
- It formulates voxelization as a two stage process: grouping and sampling.
- grouping: points {pi} are assigned to voxels {vj} based on their spatial coordinates
- sampling: sub-samples a fixed T number of points from each voxel.

hard voxelization (15F memory usage) vs dynamic voxelization (13F)
hard voxelization drops one point in v1 and misses v2
1.2 Feature Encoding

我们只关心Birds-eye view.
- drawbacks of bird-eye view
- point cloud becomes highly sparse at longer ranges
- perspective view can represent the LiDAR range image densely, and can have a corresponding tiling of the scene in the Spherical coordinate system.
- The shortcoming of perspective view: object shapes are not distance-invariant and objects can overlap heavily with each other in a cluttered scene.
- Therefore, it is desirable to utilize the complementary information from both views.

dense convolution
References
End-to-end multi-view fusion for 3d object detection in lidar point clouds, PMLR20