https://github.com/BraveGroup/FullySparseFusion
Abstraction
- no feature fusion
- point features are used for image instances.
- no box fusion
- fuses two modalities at the instance level
- LiDAR instances are not from image instances, the 2 branches are almost separate except for the 2 instance level feature interactions, self-attention, before box prediction.
- How to handle the conflict between 2 kinds of boxes? such fig 2?
- 隐式的完成了?
- Could image features help 3d box regression? or just help instance identification?
1 Introduction
1.1 Why sparse fusion?

Fig 2
2 Method
2.1 Bi-modal Instance Generation


- Camera instance Pj, |{Pj}|=m^C
- ⇒ {Mj} instance masks by [63]
- 3D point cloud P are projected onto a 2D image plane to obtain the 2D points U by camera matrix
- Uj (2d points in Mj)
- may contain noise background
- LiDAR instance Fi, |{Fi}|=m^L
- by Connected Components Labeling (CCL) in FSD V1, nips22
- may miss some foreground
- note: there are total m=m^C+m^L instances.
2.2 Bi-modal Instance-based Prediction
