Extracting Optical Flow from Generative Video Models

Inspired by the successes across vision and language where large general-purpose models outperform smaller task-specific ones, we explore extracting optical flow from large-scale generative video models. Our probing mechanism follows the "perturb-and-track" method first proposed as a zero-shot flow extraction method in the Counterfactual World Model (CWM) framework. In this approach, a small "tracer" perturbation is injected into the source frame, and flow is obtained by tracking how a pretrained next-frame predictor propagates the tracer to the target frame. We implement this interface for different off-the-shell generative video models (SVD, Cosmos, and LRAS) to leverage their strong scene understanding without any additional training. Refer to Section 4 of the paper for details.
Local Random Access Sequence (LRAS) model
We find that flow extraction benefits from generative models having the following three key properties: (1) Distributional prediction of future frames; (2) Local tokenizer that treats each spatio-temporal patch independently; and (3) Random-access decoding order that allows the model to condition on any subset of the second frame patches. We identify Local Random Access Sequence (LRAS), a recent family of autoregressive visual foundation world models that has these properties and can be zero-shot prompted to perform many tasks, e.g., 3D object manipulation, novel view synthesis, and depth estimation.
KL-tracing

Building on top of LRAS, we design a novel test-time inference procedure, KL-tracing, for optical flow extraction from models that predict a probability distribution at every patch of the target frame. Instead of tracking the perturbation through the RGB difference of clean and perturbed predictions, we use patch-wise logits for each respective predictions. For each patch, we compute the KL-divergence, and find the peak in the resulting KL-divergence map. Peak divergence should occur at the patch where the tracer is carried to, accurately reflecting the model's uncertainty due to the perturbed conditioning.


We find that using KL-divergence of the prediction logits bypasses noisy differences that can appear in the RGB space due to sampling randomness. One way to suppress the noise from random sampling is to sample multiple RGB generations and averaging them. However, it is more efficient to directly use the predicted distribution as evidenced by the figure above. This is an important benefit of autoregressive models as they directly expose the probability distribution.
KL-tracing captures challenging real-world dynamics


Conclusions
Overall, our results indicate that prompting controllable, self-supervised world models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality optical flow.