vt (t indicates the iteration time step).
Pt: one object query corresponds to a set of PoIs derived from the center and corner points
不从box直接sample feature, why?
PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest, 24