TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Abstract
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better (2025)
- Online Dense Point Tracking with Streaming Memory (2025)
- SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking (2025)
- AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos (2025)
- UncTrack: Reliable Visual Object Tracking with Uncertainty-Aware Prototype Memory Network (2025)
- OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer (2025)
- SplatVoxel: History-Aware Novel View Streaming without Temporal Training (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper