Unified 3D Scene Understanding Through Physical World Modeling

Wanhee Lee^*1, Klemen Kotar^*1, Rahul Mysore Venkatesh^*1, Jared Watrous^*1, Honglin Chen^*2, Khai Loong Aw¹, Daniel L. K. Yamins¹

^* These authors contributed equally to this work.

¹Stanford NeuroAI Lab, Stanford University, USA
²OpenAI, USA

Paper Code

A Foundation Model For 3D Understanding

Understanding 3D scenes requires flexible combinations of visual reasoning tasks — depth estimation, novel view synthesis, and object manipulation. Existing approaches address each in isolation and cannot share a common representation across tasks. In this work, we present 3WM, a physical world model for unified 3D understanding and interaction, formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse 3D tasks emerge from different inference pathways through the graph, zero-shot and without task-specific training.

3WM: A Probabilistic Graphical Model for 3D Understanding

Local Random Access Sequence Modeling makes the PGM tractable. We treat visual data as nodes in a probabilistic graphical model (PGM), but learning such a PGM directly is intractable. To make it practical and scalable, we implement it as a GPT-style next-token predictor through three key components: (a) a local quantizer — a small convolutional autoencoder that produces patch codes with strict patch independence; (b) a pointer-content representation that interleaves pointer and value tokens, letting the model condition on, query, and update arbitrary spatial regions in any order; and (c) an LLM-like autoregressive transformer trained on random traversals of the graph. Together, these let us phrase a wide range of 3D tasks as prompts, without task-specific heads, losses, or datasets, while keeping precise patch-level control.

Flexible Inference Pathways

Optical flow as a control surface. Optical-flow patches serve as a control mechanism within the graphical model, where flow is an intermediate action space: each patch specifies what moves and by how much at that spatial location. With this flexible design, we realize several inference pathways over the same model:

Novel view synthesis: RGB + dense flow → RGB
Localized edits: RGB + sparse flow → dense flow
Depth from motion: RGB + camera → flow

Each capability is a different conditional query over the same joint distribution.

Application: Novel View Synthesis

Novel View Synthesis. 3WM performs controllable novel view synthesis by conditioning on 2D optical-flow fields that encode how pixels move under a desired camera transformation. We estimate depth with an off-the-shelf model, unproject into a 3D point cloud, apply the rigid camera transformation, reproject, and use the resulting flow as the conditioning. The model generates the novel view while preserving object identity and scene consistency across camera motions.

Application: 3D Object Manipulation

3D Object Manipulation. 3WM performs 3D object translation and rotation while preserving the background. We construct a flow field where the flow on the object surface encodes the 3D transformation and the background flow is set to zero, using SegmentAnything to isolate the object mask. The same pathway supports sparse-to-dense flow completion, so a sparse motion prompt can be turned into a dense object-motion flow and applied to generate the edited image.

Application: Self-Supervised Depth Estimation

Self-Supervised Depth Estimation. 3WM estimates depth by prompting the camera-conditioned flow pathway: given an in-plane camera translation, the model predicts the induced optical flow, and depth follows from D ∝ 1/|F|. A simple downward camera translation is sufficient, and performance can be further improved by aggregating over multiple seeds. This approach learns depth cues from optical flow without any depth supervision during training.