have a look
https://github.com/junjie18/CMT
- without explicit view transformation
- generate queries based on randomly distributed reference points


(b) TransFusion first generates the queries from the high response regions of LiDAR features (actually from LiDAR-camera BEV). After that, object queries interact with point cloud features and image features separately. (c) In CMT, the object queries directly interact with multi modality features simultaneously. Position encoding (PE) is added to the multi-modal features for alignment

- Image, LiDAR features ⇒ tokens
- Position-guided Query Generator
- initialize the queries with n 3d anchor points, A, randomly,
- project A to different modalities and encode the corresponding point sets by CEM
- coordinates encoding module (CEM)
- Im PE: MLP on a set of 3D positions (sample d points along the ray of each pixel)
- PC PE: MLP on a set of 3D positions (sample h points along the height of each pixel of LiDAR BEV).
- queries are used to interact with the multi-modal tokens in Transformer Decoder
- 写的次序混乱,感觉tokens是Anchors投影决定的。
Experiments
Masked-Modal Training for Robustness
