1 Introduction

The goal is to continuously estimate the position and orientation of the object,

even in the presence of occlusions, camera motion, and changing lighting conditions.

1.1 2 approaches:

Separate Trackers — We perform tracking by detection; we first use an object detector, and then track its output image by image.
Joint Trackers — We do joint detection and 3D object tracking by sending 2 images (or point clouds) to a Deep Learning model.

matching-based vs motion-based methods

matching-based
1. extract template and search proposal features with the same embedding space, and then predict the target states by measuring the feature similarity.
  1. Siamese paradigm: takes the target template cropped from the previous frame and search area in the current frame as input.
motion-based
1. explicitly building the relative motion between the template and search point cloud
2. motion clues acted as a reference to enhance current features with past features for prediction.

Given detections at 2 consecutive timesteps...