have a look

without explicit view transformation
generate queries based on randomly distributed reference points

Untitled

(b) TransFusion first generates the queries from the high response regions of LiDAR features (actually from LiDAR-camera BEV). After that, object queries interact with point cloud features and image features separately. (c) In CMT, the object queries directly interact with multi modality features simultaneously. Position encoding (PE) is added to the multi-modal features for alignment

Untitled

Image, LiDAR features ⇒ tokens
Position-guided Query Generator
1. initialize the queries with n 3d anchor points, A, randomly,
2. project A to different modalities and encode the corresponding point sets by CEM
coordinates encoding module (CEM)
1. Im PE: MLP on a set of 3D positions (sample d points along the ray of each pixel)
2. PC PE: MLP on a set of 3D positions (sample h points along the height of each pixel of LiDAR BEV).
queries are used to interact with the multi-modal tokens in Transformer Decoder
写的次序混乱，感觉tokens是Anchors投影决定的。

Experiments

Masked-Modal Training for Robustness

Untitled