https://github.com/BraveGroup/FullySparseFusion

Abstraction

  1. no feature fusion
    1. point features are used for image instances.
  2. no box fusion
  3. fuses two modalities at the instance level
    1. LiDAR instances are not from image instances, the 2 branches are almost separate except for the 2 instance level feature interactions, self-attention, before box prediction.
  4. How to handle the conflict between 2 kinds of boxes? such fig 2?
    1. 隐式的完成了?
  5. Could image features help 3d box regression? or just help instance identification?

1 Introduction

1.1 Why sparse fusion?

Untitled

Fig 2

2 Method

2.1 Bi-modal Instance Generation

Untitled

Untitled

  1. Camera instance Pj, |{Pj}|=m^C
    1. ⇒ {Mj} instance masks by [63]
    2. 3D point cloud P are projected onto a 2D image plane to obtain the 2D points U by camera matrix
    3. Uj (2d points in Mj)
    4. may contain noise background
  2. LiDAR instance Fi, |{Fi}|=m^L
    1. by Connected Components Labeling (CCL) in FSD V1, nips22
    2. may miss some foreground
  3. note: there are total m=m^C+m^L instances.

2.2 Bi-modal Instance-based Prediction

Untitled