Local Random Access Sequence (LRAS) Modeling Architecture. The LRAS architecture has three key components: (a) a local patch quantizer trained based on a small convolutional autoencoder; (b) an video serialization process based on a "pointer-content representation", which allows arbitrary ordering of the patches during training and generation; and (c) an LLM-like autoregressive transformer to predict the contents of the next patch, trained in random sequence order.
Solving 3D Vision Tasks Using Optical Flow Intermediate. We perform the LRAS procedure on both RGB data as well as optimcal flow representation, to enable 3D scene understanding from a single image. Using optical flow tokens as conditioning, we are able to perform image editing by generating the next RGB image. Conversely, using the optical flow tokens the autogressive prediction target, we are enabled to perform depth estimation by predicting the next optical flow from a single RGB image and in-plane camera motion input.
Novel View Synthesis. Our LRAS-based method achieves high quality novel view synthesis, maintaining high fidelity and preserving object identity and scene consistency. We perform the novel view synthesis by using depth map to create a partial point cloud and projecting them back to the image plane to generate the next view.
3D Object Manipulation. LRAS enables precise and realistic 3D object manipulation, outperforming diffusion-based methods. We performed 3D object manipulation by constructing a partial point cloud from the depth map, moving it in 3D space, and projecting it back onto the image plane. A segmentation mask was used to isolate the object, ensuring that motion was applied only to the masked region.
Self-Supervised Depth Estimation. LRAS effectively estimates depth maps from single images by leveraging optical flow signals. We provide in-plane camera motion as input to LRASFLOW and predict the optical flow induced by camera motion. Then, we compute the magnitude of the optical flow to compute the disparity, which is inversely proportional to depth. This approach allows us to estimate depth maps in a self-supervised manner leveraging optical flow, without requiring any ground truth depth data for training.