Memory mechanism to utilize past information
Localization in a coarse-to-fine scheme using Box Priors given in the first frame.

without cropping or sampling?

appearance of the target + spatial contextual information + motion-centric paradigm that explicitly models the target’s motion between two adjacent frames.

propagate target cues solely from the latest frame to the current frame, thereby neglecting rich in formation contained in other past frames. This limitation renders 3D SOT a challenging task, especially in cases of large appearance variation or target disappearance caused by occlusion.
1. neglecting information in the latest frame could result in the network failing to capture lasting appearance changes, such as the gradual sparsification of point clouds as the tracked target moves further away.
substantial differences in size and geometry across the various categories of tracked targets also pose challenges for 3D SOT, which has been overlooked by previous works , two paradigms: point-based [36, 31, 23] and voxel-based [10].
1. voxel-based, like V2B [10], tracked targets with simple shapes and large sizes such as vehicles, can fit well in voxels, leading to more precise localization than point-based heads such as X-RPN [31]. However, for categories such as pedestrians, which have complex geometries and small sizes, voxelization leads to considerable information loss, thereby degrading tracking performance. As mentioned in V2B [10], the choice of different voxel sizes can significantly impact tracking performance.

References

MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors, 2023
TAT [16]: Temporal-aware Siamese tracker: Integrate temporal context for 3D object tracking, accv22
CXTrack [31]: CXTrack: Improving 3D point cloud tracking with contextual information, 22