LRAS: 3D Scene Understanding with Local Random Access Sequence Modeling

3D Scene Understanding Through Local Random Access Sequence Modeling (LRAS)

Wanhee Lee*1, Klemen Kotar*1, Rahul Mysore Venkatesh*1, Jared Watrous*1, Honglin Chen*2, Khai Loong Aw1, Daniel L. K. Yamins1

* These authors contributed equally to this work.

1Stanford NeuroAI Lab, Stanford University, USA
2OpenAI, USA

Teaser Image

A Foundation Model For 3D understanding

Geometric scene understanding from a single image involves at least three key components: depth estimation, novel view synthesis, and object manipulation. In this work, we present a single foundation model for 3D understanding that can achieve all of these tasks in a self-supervised manner in a unified, simple and scalable way. To achieve this goal, we introduce LRAS—a novel architecture capable of generating high-quality images in an diffusion-free autoregressive fashion.

LRAS: Making Autoregressive Visual Modeling Work

Architecture

Local Random Access Sequence (LRAS) Modeling Architecture. The LRAS architecture has three key components: (a) a local patch quantizer trained based on a small convolutional autoencoder; (b) an video serialization process based on a "pointer-content representation", which allows arbitrary ordering of the patches during training and generation; and (c) an LLM-like autoregressive transformer to predict the contents of the next patch, trained in random sequence order.

"LRASing" An Optical Flow Intermediate

Optical Flow Intermediate

Solving 3D Vision Tasks Using Optical Flow Intermediate. We perform the LRAS procedure on both RGB data as well as optimcal flow representation, to enable 3D scene understanding from a single image. Using optical flow tokens as conditioning, we are able to perform image editing by generating the next RGB image. Conversely, using the optical flow tokens the autogressive prediction target, we are enabled to perform depth estimation by predicting the next optical flow from a single RGB image and in-plane camera motion input.

NVS Method

Application: Novel View Synthesis

Novel View Synthesis. Our LRAS-based method achieves high quality novel view synthesis, maintaining high fidelity and preserving object identity and scene consistency. We perform the novel view synthesis by using depth map to create a partial point cloud and projecting them back to the image plane to generate the next view.

Novel View Synthesis Results
Object Motion Method

Application: 3D Object Manipulation

3D Object Manipulation. LRAS enables precise and realistic 3D object manipulation, outperforming diffusion-based methods. We performed 3D object manipulation by constructing a partial point cloud from the depth map, moving it in 3D space, and projecting it back onto the image plane. A segmentation mask was used to isolate the object, ensuring that motion was applied only to the masked region.

Object Manipulation Method
Depth Estimation Method

Application: Self-Supervised Depth Estimation

Self-Supervised Depth Estimation. LRAS effectively estimates depth maps from single images by leveraging optical flow signals. We provide in-plane camera motion as input to LRASFLOW and predict the optical flow induced by camera motion. Then, we compute the magnitude of the optical flow to compute the disparity, which is inversely proportional to depth. This approach allows us to estimate depth maps in a self-supervised manner leveraging optical flow, without requiring any ground truth depth data for training.

Depth Estimation Method