Accurate motion estimation is a core element of controllable video generation and robotics,
as evidenced by works like
Motion-I2V,
Motion-Prompting, Gen2Act, and RoboTap.
Developing self-supervised techniques that can scale well primarily by training on more videos is essential for progress in these application domains.
Heuristics and synthetic data are inherently limited.
Heuristics like the popular photometric consistency loss
[1,
2]
are noisy proxies of the dynamic real world and are only valid in a limited number of settings.
Synthetic data rendering systems are imperfect
[3,
4],
and beyond photorealism, modeling the complex dynamics of objects like cloths accurately and efficiently remains an open problem.
No human-labeled datasets of sufficient scale exist.
The recently proposed
Counterfactual World Modeling (CWM) paradigm
is an approach for self-supervised learning and extraction of perceptual propreties like
optical flow, object segments and depth maps.
At the core of counterfactual world modeling is a base model in the form of a
sparse RGB-conditioned next-frame predictor
\( \boldsymbol{\Psi}^{\texttt{RGB}} \) -- a two-frame masked autoencoder.
Useful visual structures like motion, occlusion, depth, and object segments can be extracted from
\( \boldsymbol{\Psi}^{\texttt{RGB}} \) using simple procedures we refer to as
counterfactual probes or programs. Counterfactual programs work by performing interventions to the
predictor's inputs, like placing a visual marker, and analyzing the predictor’s response.
Fixed hand-designed perturbations illustrate the conceptual promise of CWMs, but do not enable extraction
of motion from CWM with state-of-the-art accuracy. In this work, we identify that such markers are out of domain
for the predictor, and are not tailored to the appearance of the region we wish to track, leading to unreliable reconstructions and spurious predictions.
Instead, we propose to learn to predict perturbations using a parameterized function, which we learn without any explicit supervision.
Because hand-designed perturbations like green squares are suboptimal, we make the intuitive next step and
design a function to learn and predict them instead. However, the challenge that arises is how to learn the parameters of this function in a principled but unsupervised manner,
without resorting to labeled data or heuristics like photometric loss.
Our solution is uses a boostrapping technique to generalize the underlying asymmetric sparse prediction problem from RGB to flow. Specifically, the parameterized function \(\texttt{FLOW}_\theta \) extracts flow from a frozen \( \boldsymbol{\Psi}^{\texttt{RGB}} \). These putative flows are used as the input to boostrap a sparse flow conditioned predictor \(\boldsymbol{\Psi}^{\texttt{flow}}_\eta\) that -- just like the base model itself -- solves a next-frame prediction problem. Both \(\theta\) and \(\eta\) parameters are jointly trained. This creates an information bottleneck that ensures that \(\texttt{FLOW}_\theta \) will generate good perturbations.
Using our proposed parameterization and learning algorithm to optimize the perturbation appearance, we can achieve state-of-the-art self-supervised point tracking on TAP-Vid, and are competitive with supervised techniques. See our paper for the complete quantitative evaluation. This approach of using an additional predictor to create an information bottleneck and optimize counterfactuals is itself generic. Extending it to higher level structures like segments and shape is an exciting direction for future work.