ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

Tingyang Zhang^1,2, Chen Wang¹, Zhiyang Dou^1,3, Qingzhe Gao⁴,
Jiahui Lei¹, Baoquan Chen², Lingjie Liu¹

¹University of Pennsylvania ²Peking University
³The University of Hong Kong ⁴Shandong University

Our Results

Sparse trajectories produced by ProTracker. Our method robustly generates accurate and smooth trajectories.

Abstract

In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks.

Pipeline

Pipeline overview of our proposed method. (1) Sample & Chain: Key points are initially sampled and linked through optical flow chaining to produce preliminary trajectory predictions. (2) Long-term Correspondence: Key points are re-localized over longer time spans to maintain continuity, even for points that temporarily disappear. (3) Dual-Stage Filter: Masks and feature filters are applied to remove incorrect predictions, reducing noise for subsequent steps. (4) Probabilistic Integration: Filtered flow predictions across frames are first integrated and then combined with long-term keypoint to produce the final prediction, producing smoother and more consistent trajectories.

TAP-Vid-DAVIS Comparisons

Qualitative comparisons to DINO-Tracker [1], CaDex++ [2] and LocoTrack [3] on TAP-Vid-DAVIS [7].
Our method is able to capture finer details and recover the full trajectory of less distinctive points.

Qualitative comparisons to TAPTR [4], SpaTrack [5] and Co-Tracker [6] on TAP-Vid-DAVIS [7].
While these sliding window based trackers are prone to drift and vulnerable to occlusions, our method reliably maintains accurate tracking of the same point.

Comparisons on challenging videos

To further illustrate our method's robustness, we conduct experiments on challenging videos from the web.Some of the previous methods relies on computing heatmap between the query point and the target frame. However, the per-frame heatmap lacks temporal-awareness and may confuse between different objects. We address this issue by leveraging mask and combining the heatmap with optical flow. By comparing the results of our method with DINO-Tracker[1] and TAPIR[8], we show that although our method also relies on per-frame heatmap to extract keypoints, our method has strong temporal-awareness and is able to tell between similar objects.

To further demonstrate the robustness of our method, we conduct experiments on extended videos from TAP-Vid-DAVIS, simulating high frame-rate videos by repeating each frame three times. In contrast to typical sliding-window or flow-based trackers (such as TAPTR [4], SpatialTracker [5] and Co-Tracker [6]), which tend to accumulate errors and drift over time, our integration of long-term key points with short-term optical flow enables continuous, drift-free tracking of the same point through occlusions. Experiments are conducted in full resolution.

Ablations

We conduct ablation study on different components of our method. w/o key indicates directly using the results from the flow integration as output without the joint integration with long-term key points. w/o geo removes the process of filtering by the geometry-aware feature. w/o mask uses the rough flow prediction without object-level filtering. w/o pro replaces the probabilistic integration by choosing the prediction of the lowest σ as the final results. We visualize the results on libby, parkour, horsejump-high, shooting and car-roundabout, respectively.
The results shows that without long-term keypoints, the methods cannot locate some point when they appear after occlusion (e.g. libby, parkour);
without geometry-aware feature, the methods may drift to other parts (e.g. car-roundabout,shooting);
without mask, the methods may confuse between different objects (e.g. parkour, shooting);
without probabilistic integration, the methods can be less accurate (e.g. car-roundabout, horsejump-high).

Acknowledgements

We would like to thank the authors of DINO-Tracker and the authors of TAPTR for sharing their evaluation data on TAP-Vid-Kinetics. Our code is mainly built upon DINO-Tracker and MFT, and this webpage template is built upon DINO-Tracker.We thank the authors for their brilliant work.

Bibtex

  
    @article{zhang2024protracker,
      title={ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking},
      author={Tingyang Zhang and Chen Wang and Zhiyang Dou and Qingzhe Gao, Jiahui Lei and Baoquan Chen and Lingjie Liu},
      journal={arXiv preprint arxiv:2501.03220},
      year={2025}
    }

References

[1] Narek Tumanyan, Assaf Singer, Shai Bagon, and Tali Dekel. Dino-tracker: Taming dino for self-supervised point tracking in a single video. ECCV, 2024
[2] Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, and Kostas Daniilidis. Track everything everywhere fast and robustly. ECCV, 2024
[3] Seokju Cho, Jiahui Huang, Jisu Nam, Honggyu An, Seungryong Kim, and Joon-Young Lee. Local all-pair correspondence for point tracking. ECCV, 2024
[4] Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, and Lei Zhang. Taptr: Tracking any point with transformers as detection. ECCV, 2024
[5] Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. CVPR, 2024
[6] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co-tracker: It is better to track together. Arxiv, 2023
[7] Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. TAP-vid: A benchmark for tracking any point in a video. NeurIPS, 2022
[8] Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. ICCV,2023