Optimizing Counterfactuals

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

Stefan Stojanov*, David Wendt*, Seungwoo Kim*, Rahul Mysore Venkatesh*, Kevin Feigelis,
Jiajun Wu, Daniel LK Yamins

Stanford University

State-of-the-art self-supervised point tracking with optimized Counterfactual World Models (Opt-CWM)

Why is self-supervised motion estimation important?

Accurate motion estimation is a core element of controllable video generation and robotics, as evidenced by works like Motion-I2V, Motion-Prompting, Gen2Act, and RoboTap. Developing self-supervised techniques that can scale well primarily by training on more videos is essential for progress in these application domains.

Heuristics and synthetic data are inherently limited. Heuristics like the popular photometric consistency loss [1, 2] are noisy proxies of the dynamic real world and are only valid in a limited number of settings. Synthetic data rendering systems are imperfect [3, 4], and beyond photorealism, modeling the complex dynamics of objects like cloths accurately and efficiently remains an open problem. No human-labeled datasets of sufficient scale exist.

Using counterfactual probes to operationally define temporal correspondence

The recently proposed Counterfactual World Modeling (CWM) paradigm is an approach for self-supervised learning and extraction of perceptual propreties like optical flow, object segments and depth maps. At the core of counterfactual world modeling is a base model in the form of a sparse RGB-conditioned next-frame predictor \( \boldsymbol{\Psi}^{\texttt{RGB}} \) -- a two-frame masked autoencoder. Useful visual structures like motion, occlusion, depth, and object segments can be extracted from \( \boldsymbol{\Psi}^{\texttt{RGB}} \) using simple procedures we refer to as counterfactual probes or programs. Counterfactual programs work by performing interventions to the predictor's inputs, like placing a visual marker, and analyzing the predictor’s response.

**Counterfactual probes for motion extraction.** Applying a simple visual perturbation, a marker in the form of a bright green square, and analyzing the predictor's response, enables the extraction of **flow** and **occlusion**, the key elements of temporal correspondence that enable visual tracking.

The problem with hand-designed counterfactuals

Fixed hand-designed perturbations illustrate the conceptual promise of CWMs, but do not enable extraction of motion from CWM with state-of-the-art accuracy. In this work, we identify that such markers are out of domain for the predictor, and are not tailored to the appearance of the region we wish to track, leading to unreliable reconstructions and spurious predictions. Instead, we propose to learn to predict perturbations using a parameterized function, which we learn without any explicit supervision.

**Hard-coded counterfactuals can fail.** Fixed perturbations like colored squares are out of domain for the predictor, leading to spurious predictions. Our solution to solving this proble is to parameterize, and then optimize, the counterfactaual probe generator.

Opt-CWM: Optimizing counterfactual probes

Because hand-designed perturbations like green squares are suboptimal, we make the intuitive next step and design a function to learn and predict them instead. However, the challenge that arises is how to learn the parameters of this function in a principled but unsupervised manner, without resorting to labeled data or heuristics like photometric loss.

**Parameterizing counterfactual probes.** We design a function, parameterized as an MLP that predicts the parameters of a 2D gaussian, to predict distinct visual perturbations for any pixel we we wish to track across frames.

Bootstrapping a sparse prediction objective

Our solution is uses a boostrapping technique to generalize the underlying asymmetric sparse prediction problem from RGB to flow. Specifically, the parameterized function \(\texttt{FLOW}_\theta \) extracts flow from a frozen \( \boldsymbol{\Psi}^{\texttt{RGB}} \). These putative flows are used as the input to boostrap a sparse flow conditioned predictor \(\boldsymbol{\Psi}^{\texttt{flow}}_\eta\) that -- just like the base model itself -- solves a next-frame prediction problem. Both \(\theta\) and \(\eta\) parameters are jointly trained. This creates an information bottleneck that ensures that \(\texttt{FLOW}_\theta \) will generate good perturbations.

Using our proposed parameterization and learning algorithm to optimize the perturbation appearance, we can achieve state-of-the-art self-supervised point tracking on TAP-Vid, and are competitive with supervised techniques. See our paper for the complete quantitative evaluation. This approach of using an additional predictor to create an information bottleneck and optimize counterfactuals is itself generic. Extending it to higher level structures like segments and shape is an exciting direction for future work.

**What do the optimal probes look like?** We visualize the predicted color and amplitude of the three-channel Gaussian perturbations for every location in the image. We observe that both the amplitudes and standard deviations are different for different objects in the image, which supports our hypothesis that perturbations should be bespoke.

BibTeX

        
	    @misc{stojanov2025selfsupervisedlearningmotionconcepts,
		  title={Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals}, 
		  author={Stefan Stojanov and David Wendt and Seungwoo Kim and Rahul Venkatesh and Kevin Feigelis and Jiajun Wu and Daniel LK Yamins},
		  year={2025},
		  eprint={2503.19953},
		  archivePrefix={arXiv},
		  primaryClass={cs.CV},
		  url={https://arxiv.org/abs/2503.19953}, 
	    }