Charles R. Qi1, Or Litany1, Kaiming He1, Leonidas J. Guibas1,2
1:Facebook AI Research, 2:Stanford University
Charles R. Qi, Hao Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CVPR, 2017.
Backgrounds
2019: Current 3D object detection methods are heavily influenced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regular grids (i.e., to voxel grids or to bird’s eye view images), or rely on detection in 2D images to propose 3D boxes. Few works have attempted to directly detect objects in point clouds.
2024: ??

Figure 1. 3D object detection in point clouds with a deep Hough voting model.
- a major challenge: a 3D object centroid can be far from any surface point thus hard to regress accurately in one step.
- In images there often exists a pixel near the object center, but it is often not the case in point clouds
- point based networks have difficulty aggregating scene context in the vicinity of object centers.
- Simply increasing the receptive field does not solve the problem because as the network captures larger context, it also causes more inclusion of nearby objects and clutter.
- By voting we essentially generate new points that lie close to objects centers, which can be grouped and aggregated to generate box proposals.
Related works
- [Dense] [42, 12] extend 2D detection frameworks such as the Faster/Mask R-CNN [37, 11] to 3D
- voxelize the irregular point clouds to regular 3D grids and apply 3D CNN detectors
- fails to leverage sparsity in the data ⇒ high computation cost
- [4, 55] project points to regular 2D bird’s eye view images and then apply 2D detectors to localize objects. 【都是3D feature,3D detectors呀?】
- sacrifices geometric details which may be critical in cluttered indoor environments.
- [20, 34] proposed a cascaded two-step pipeline by firstly detecting objects in front-view images and then localizing objects in frustum point clouds extruded from the 2D boxes,
- which however is strictly dependent on the 2D detector and will miss an object entirely if it is not detected in 2D.
Methods

N points ⇒ M seeds ⇒ M votes ⇒ K clusters ⇒ k boxes ⇒ k’ boxes via 3D NMS.
- Point cloud feature learning backbone: PointNet++
- ⇒ M seeds
- Vote: a shared voting module / MLP
-
input: seeds: si = [xi; fi], coordinate: xi ∈ R^3 and feature: fi ∈ R^C
-
output: votes: vi = [yi; gi], yi = xi + ∆xi and gi = fi + ∆fi

- only seeds on object surface are used
- Vote clustering through sampling and grouping
-
farthest point sampling: M votes ⇒ k clusters
- sample a subset of K votes using farthest point sampling based on {yi} in 3D Euclidean space ⇒ {vi_k}
-
grouping into
