https://github.com/Ghostish/Open3DSOT
Appearance vs motion

Due to the Siamese paradigm, previous methods have to transform the target template from the world coordinate system to its own object coordinate system. This transformation adversely breaks the motion connection between consecutive frames.


M_{t-1,t} is the motion: R^4;

segment the target points from their surrounding
Similar to [45], [57], we construct a spatial-temporal point cloud P_{t−1,t} ∈ R^(Nt−1+Nt)×4 from Pt−1 and Pt by adding a temporal channel to each point and then merging them together.
a prior-targetness map St−1,t∈ R^(Nt−1+Nt)×4.

PointNet ⇒ spatial-temporal target point cloud P^~_{t−1,t} ∈ R^(Mt−1+Mt)×4, Mt is the number of target points in frame t.