Seurat: From Moving Points to Depth

Abstract

Accurate depth estimation from monocular videos remains challenging due to ambiguities inherent in single-view geometry, as crucial depth cues like stereopsis are absent. However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. Specifically, we use off-the-shelf point tracking models to capture 2D trajectories. Then, our approach employs spatial and temporal transformers to process these trajectories and directly infer depth changes over time. Evaluated on the TAPVid-3D benchmark, our method demonstrates robust zero-shot performance, generalizing effectively from synthetic to real-world datasets. Results indicate that our approach achieves temporally smooth, high-accuracy depth predictions across diverse domains.

Motivation

Can you tell whether the object is moving farther or closer just by looking at the trajectory?
By only looking at the tracked points (right), we can easily perceive that the object (here, a car) is moving away.
If we can, so can our model!

Method

Use an off-the-shelf point tracker to extract 2D trajectories of query points and a dense supporting grid.
Process these trajectories with a temporal transformer and a spatial transformer in two separate branches.
Inject motion information encoded by the supporting branch into the query branch via cross-attention.
Finally, use two regression heads to output ratio depths of both supporting and query trajectories.

Visualization on DAVIS Dataset

In this visualization, tracks after the first frame are predicted solely from their trajectories, without using any RGB images or pre-trained models.

Performance

Quantitative results of affine-invariant video depth on TAPVid-3D minival split

Our model significantly surpasses the strong video depth model and point tracker baseline.

Reference

TAPVid-3D is a benchmark for evaluating 3D point tracking, consists of Aria, DriveTrack and Panoptic Studio.
LocoTrack is a lightweight and precise point tracker achieved with local 4D correlation.
CoTracker is a strong point tracker that tracks points together.
DepthCrafter and ChronoDepth are video depth estimator with diffusion prior.