Overview

The ability to understand physical dynamics is critical for agents to act in the world. Here, we use Counterfactual World Modeling (CWM) to extract vision structures for dynamics understanding. CWM uses a temporally-factored masking policy for masked prediction of video data without annotations. This policy enables highly effective "counterfactual prompting" of the predictor, allowing a spectrum of visual structures to be extracted from a single pre-trained predictor in a zero-shot manner. We demonstrate that these structures are useful for physical dynamics understanding, allowing CWM to achieve the state-of-the-art performance on the Physion benchmark.

A simple and powerful masking policy for pre-training

Given a frame pair input, we train an autoencoder that takes in dense visible patches from the first frame, and only a sparse subset of patches from the second frame as inputs. The model learns to predict the masked patches in second frame. This policy encourages the model to learn disentangled representations of appearance and dynamics. Such an asymmetric masking policy, with high masking in the second frame, makes the predictor learn to concentrate scene transformations into the embeddings of a few visible patches in the second frame. This enables meaningful control over the predictions via patch-level modifications.

Loading animation
Pre-training via temporally factored masking

The masking policy enables scene manipulations via patch-level prompting

Once the model is trained with temporal factored masking, we can sample a patch-level prompt, to meaningfully control scene dynamics and steer the prediction of the model.

Try it yourself!

Adjust the sliders to move the green patches and see the resulting counterfactual motions. Evaluate more examples real-time on our Hugging Face Logo

Input image

Interpolate start reference image.

Counterfactual motion

Loading...

vertical

horizontal

Try a different image by clicking on the images below:


Patch-level prompting allows for zero-shot extraction of vision structures

The ability to exert control over the scene using patch-level modifications allows vision structures to be extracted in a zero-shot manner using a pre-trained CWM model.
  • Keypoints are naturally defined as those tokens that, when made visible to the predictor, yield the lowest reconstruction error on the rest of the scene.
  • Optical flow can be queried by adding a small “tracer” to the first frame and localizing the response of the perturbation in the counterfactual prediction of the next frame.
  • Object segmentations can be queried by sampling counterfactual motion at a pixel location, predicting the next frame, and grouping together parts of the image that move together.

Keypoint

Optical flow

Object segmentation

SOTA results on the physion benchmark for physics understanding

We study the usefulness of the extracted structures for dynamics understanding by evaluating on Physion -- a benchmark which assesses physical scene understanding by how well those models predict whether two objects will collide given a context video (object contact prediction).

Loading animation
Quantitative performance on Physion benchmark

BibTeX

@article{venkatesh2023understanding,
  title={Understanding Physical Dynamics with Counterfactual World Modeling},
  author={Venkatesh, Rahul and Chen, Honglin and Feigelis, Kevin and Bear, Daniel M and Jedoui, Khaled and Kotar, Klemen and Binder, Felix and Lee, Wanhee and Liu, Sherry and Smith, Kevin A and others},
  journal={arXiv preprint arXiv:2312.06721},
  year={2023}
}