Taming generative video models
for zero-shot optical flow extraction

Seungwoo Kim*1 · Khai Loong Aw*1 · Klemen Kotar*1
Cristobal Eyzaguirre1 · Wanhee Lee1 · Yunong Liu1 · Jared Watrous1
Stefan Stojanov1 · Juan Carlos Niebles1 · Jiajun Wu1 · Daniel L. K. Yamins1

1Stanford
(* equal contribution)


We introduce KL-tracing, a novel test-time inference procedure using the Kullback-Leibler (KL) divergence of prediction logits for zero-shot extraction of optical flow from a generative video model without any task-specific fine-tuning. We obtain state-of-the-art point tracking / optical flow results when combined with the Local Random Access Sequence (LRAS) model.

Extracting Optical Flow from Generative Video Models

General perturb-and-track method for extractiong zero-shot flow from generative video models.
"Perturb-and-track" zero-shot optical flow extraction for generative video models. Zero-shot flow is obtained from generative video models by tracing the injected perturbation across the predicted next frame. The delta between next frame predictions of clean and perturbed inputs should show up at the location to which the perturbation was carried.

Inspired by the successes across vision and language where large general-purpose models outperform smaller task-specific ones, we explore extracting optical flow from large-scale generative video models. Our probing mechanism follows the "perturb-and-track" method first proposed as a zero-shot flow extraction method in the Counterfactual World Model (CWM) framework. In this approach, a small "tracer" perturbation is injected into the source frame, and flow is obtained by tracking how a pretrained next-frame predictor propagates the tracer to the target frame. We implement this interface for different off-the-shell generative video models (SVD, Cosmos, and LRAS) to leverage their strong scene understanding without any additional training. Refer to Section 4 of the paper for details.

Local Random Access Sequence (LRAS) model

We find that flow extraction benefits from generative models having the following three key properties: (1) Distributional prediction of future frames; (2) Local tokenizer that treats each spatio-temporal patch independently; and (3) Random-access decoding order that allows the model to condition on any subset of the second frame patches. We identify Local Random Access Sequence (LRAS), a recent family of autoregressive visual foundation world models that has these properties and can be zero-shot prompted to perform many tasks, e.g., 3D object manipulation, novel view synthesis, and depth estimation.

KL-tracing

KL divergence method for extractiong zero-shot flow from LRAS.
KL-tracing: tracking perturbation response in logit space instead of RGB space. We extend the zero-shot procedure introduced above to utilize the distributional prediction property of LRAS.

Building on top of LRAS, we design a novel test-time inference procedure, KL-tracing, for optical flow extraction from models that predict a probability distribution at every patch of the target frame. Instead of tracking the perturbation through the RGB difference of clean and perturbed predictions, we use patch-wise logits for each respective predictions. For each patch, we compute the KL-divergence, and find the peak in the resulting KL-divergence map. Peak divergence should occur at the patch where the tracer is carried to, accurately reflecting the model's uncertainty due to the perturbed conditioning.

KL-tracing advantage
Our method (D), KL-tracing, utilizes the unique properties of LRAS to localize the effect of the perturbation to the query point. The resulting difference between perturbed and clean predictions is much sharply focused at the target location.
Comparing KL vs RGB
KL-tracing bypasses noisy RGB differences, which result from sampling randomness. Computing the KL divergence of the clean and perturbed prediction logits (last column) is more efficient, yet functionally similar to computing the average RGB difference over many samples (second last column).

We find that using KL-divergence of the prediction logits bypasses noisy differences that can appear in the RGB space due to sampling randomness. One way to suppress the noise from random sampling is to sample multiple RGB generations and averaging them. However, it is more efficient to directly use the predicted distribution as evidenced by the figure above. This is an important benefit of autoregressive models as they directly expose the probability distribution.

KL-tracing captures challenging real-world dynamics

Table comparing to baselines
TAP-Vid Evaluation Results. Our final method, combining KL-tracing with LRAS, achieves state-of-the-art optical flow on the challenging TAP-Vid DAVIS benchmark, as well as TAP-Vid Kubric (synthetic).

Comparing to baselines
Generalization to in-the-wild YouTube videos. Further, KL-tracing with LRAS is able to generalize to challenging, in-the-wild YouTube videos featuring the long-tail of real-world phenomena such as homogeneous objects, motion blur, partial occlusion, and rapid camera motion.

Conclusions

Overall, our results indicate that prompting controllable, self-supervised world models is a scalable and effective alternative to supervised or photometric-loss approaches for high-quality optical flow.