Physical Understanding of Visual Scenes

When we look at a scene we see distinct objects imbued with physical properties: color, texture, location, shape, motion, and so on. We also infer complex and dynamical relationships within the scene: a cup may be balanced precariously on a table’s edge, a piece of debris may be rolling in front of a truck, or a dog may have temporarily disappeared behind a tree. All of these judgments involve physical understanding of our visual observations.

Current computer vision algorithms lack physical scene understanding

Computer vision researchers have not yet developed algorithms that capture human physical understanding of scenes. This is despite outstanding progress in capturing other visual abilities: in particular, convolutional neural networks (CNNs) supervised with human-labeled data have excelled at categorizing scenes and objects. Yet when CNNs are optimized to perform tasks of physical understanding, such as predicting how a scene will evolve over time, they make unrealistic predictions – of objects dissolving, disappearing, or merging. Being able to tell a cat from a dog is not the same as knowing that a dog will still exist and keep its form when it runs out of sight.

Realistic object motion (left) is not predicted by a CNN (right).

Representing scenes as Physical Scene Graphs

The need to understand physical objects suggests a different type of scene representation than the image-like layers of features found in CNNs. Instead, different objects might be represented by distinct entities with associated attributes and interelationships – in other words, a graph representation. We formalize this idea as a Physical Scene Graph (PSG): a hierarchical graph in which nodes represent objects or their parts, edges represent relationships, and a vector of physically meaningful attributes is bound to each node.

Under such a representation, many aspects of physical understanding become natural to encode: for instance, “object permanence” corresponds to the property that graph nodes do not suddenly disappear, and “shape constancy” to the property that a node’s shape attributes usually don’t change. These desirable qualities therefore raise the challenge of inferring PSG representations from visual input.

Physical Scene Graphs (PSGs) are graphical representations of scenes. PSGNets are neural networks that infer PSGs from visual inputs.

PSGNets: building graph representations from visual input

We break the problem of inferring PSGs into three subproblems, each inspired by a critical process in human physical scene understanding:

First, the extraction of retinotopic features – especially geometric ones like depth, surface normals, and motion – from visual input.
Second, the perceptual grouping of extracted scene elements into distinct physical objects.
Third, the binding of features to different objects, such that multiple, distributed properties (e.g. color, shape, and motion) are affixed to each object.

Both infants and non-human animals gain these abilities early in life, before or without the help of externally given labels. In creating algorithms for building graph representations – called PSGNets – we therefore designed new neural network modules to implement each of these processes and to learn from self-supervision. In the following sections, we describe how PSGNets address each of the three subproblems.

Extracting physically meaningful features with ConvRNNs

Unlike the task of inferring scene structure, CNNs are already good at extracting geometric features – they just do not group these features into discrete objects. However, different types of geometric information are typically found in different CNN layers: early layers can encode sharp object boundaries, whereas later layers better encode broad-scale information about depth and surface shape. To combine both types of information in a single set of features, we developed a recurrent CNN (ConvRNN) that learns to combine early and late layers through locally recurrent “cells” and long-range feedback connections. As expected, this recurrent feature extraction architecture is mainly crucial for learning to represent geometric properties; lower-level visual properties, like color and texture, are largely unaffected by instead using a feedforward CNN feature extractor.

Feature Extraction — A ConvRNN performs the first stage of PSG construction: extracting retinotopic visual features useful for inferring physical properties. Local recurrence and long-range feedback from higher to lower convolutional layers are especially important for inferring geometric properties of the scene.

Perceptually grouping visual elements into objects with Graph Pooling

Given a set of visual elements – starting with the features extracted from a ConvRNN – the PSGNets’ Graph Pooling modules learn to group these elements into coarser entities – objects or parts of objects. For example, a PSGNet may learn that when two nearby elements have similar texture, they usually belong to the same physical entity. Graph Pooling modules implement perceptual grouping by first predicting pairwise graph edges between visual elements, then clustering the resulting graph: the clusters become a new set of nodes and form a coarser, higher level of the PSG. This also automatically creates a map of which pixels in the input are represented by each node – an unsupervised segmentation of the scene.

Graph Pooling and Vectorization — Graph Pooling and Graph Vectorization modules operate in tandem to perform perceptual grouping and feature binding, respectively. Each pair of modules constructs a PSG level on top of the previous level, starting with the base level: the ConvRNN-extracted features.

Binding explicit properties to objects with Graph Vectorization

The third process – binding properties to each “object” represented by a graph node – itself has two parts: first, binding any properties to a graph node, and second, ensuring that these properties explicitly represent physical aspects of an object. Given that each node has a corresponding visual segment of extracted features, the straightforward way to compute an “attribute vector” for each node is to summarize those features. PSGNets do this by taking simple statistics over the segmented features – means, variances, spatial moments – as well as along the boundary contours of each segment, which carry information about each object’s 2D silhouette. We call this process “Vectorization,” because it encodes visual features distributed across segments of a feature map as a single data vector per graph node.

Encouraging attributes to represent physical properties explicitly with Graph Rendering

These “raw” attribute vectors are unlikely to represent objects’ physical properties explicitly. The final pieces of a PSGNet – Graph Rendering modules – encourage physical properties to be easily decodable. They do this by rendering PSG nodes and attributes into image-level predictions of physical attributes – color, depth, surface normals, shapes – without using any learnable parameters. Since the node attributes must directly predict these scene properties and minimize error with (self)-supervision signals, the property representations cannot be distributed through the weights of a decoder network; they must instead remain bound to the nodes themselves. We demonstrate that this procedure works by manually editing PSG nodes and attribute vectors and observing that the rendered scene changes exactly as expected: deleting a node removes its object from the scene, changing its “position” attribute changes only its position, and so on.

Progress toward physical scene understanding with PSGNets

We now explain three ways in which PSGNets make substantial progress on learning to represent scene structure and physical properties.

First, PSGNets dramatically outperform CNN-based models on the visual task of unsupervised scene segmentation – that is, decomposing scenes into objects without human labels that indicate how to do so. While the difference between PSGNets and baseline methods is slight on simple synthetic scenes, PSGNets are more than twice as accurate as the baselines when trained and tested on either synthetic scenes with realistic objects or images of real, complex scenes. Such a large performance gap between models implies that the PSGNet architecture is much better suited than CNNs for inferring scene structure.
Second, PSGNets learn to segment scenes far more efficiently than baselines, and their performance generalizes far better to visual inputs unlike those seen in training. PSGNets can segment complex images after seeing each training scene once, whereas CNNs reach their (lower) peak performance only after hundreds of views of a scene. Moreover, PSGNets segment scenes from entirely unseen datasets nearly as well as ones they were trained on. In contrast, CNN-based models are narrowly fit to their training set. These findings imply that the PSGNet architecture has better inductive biases for scene structure inference, likely because it is easier to learn common patterns of visual feature combinations than to learn what entire object segments look like, as CNNs must.
Third, PSGNets can take advantage of motion during training to perceptually group static scenes better during evaluation. The clearest indication that something should be treated as a physical object is that it moves independently of other scene elements. However, most visual elements that we ought to perceive as physical objects are not moving at any given moment. This creates an ideal self-supervised learning problem: a PSGNet learns to segment scenes based on object motion, but simultaneously treats this segmentation as a learning signal to group the motion-independent features. This allows grouping of static object parts that might otherwise look like they are part of distinct objects (e.g. because they have very different textures.)

PSGNet Results — PSGNets perform unsupervised segmentation much more accurately than CNN baselines, especially on complex objects and real scenes. Different levels of a learned PSG encode objects and physical properties at different grains of resolution, with the top level nodes corresponding most directly to whole objects.

Learning From Motion — Training on scenes with object motion allows PSGNetM, a self-supervising model architecture, to perceptually group elements of static scenes into complex objects. Remarkably, a PSGNetM trained in this way can roughly segment physical objects completely unlike those it has seen before!

Future directions: performing physical tasks from visual input

Our results show that PSGNets can efficiently learn general routines for segmenting complex objects and scenes, and they especially benefit from observing object motion. They do this without supervision of scene structure – instead learning patterns of which visual and physical properties tend to belong to the same object.

The key next step is to use the physical graph representation for tasks that flummox other computer vision algorithms – tasks that require physical understanding more than categorical knowledge. Whether an object will slide or roll, how soon two things will collide, and where to look for something hidden are problems that depend on just the sort of scene structure and physical properties that PSGNets encode. By learning to build structured, physical representations of scenes, we hope that PSGNets will begin to bridge the critical gap between visual perception and physical understanding.

Future Directions — Models that operate on PSG-structured representations may be able to perform physical understanding-based tasks, like future prediction, better than models that lack graph-structured intermediates. This is because physical properties like object permanence and shape constancy are simple to express and enforce in the PSG representation.

If you are interested in using PSGNets or their components in your work, please check out our paper and code repository. If you have any questions or simply would like to discuss physical understanding of scenes, do not hesitate to get in touch!

To cite our work, please use:

@article{bear2020learning,
  title={Learning physical graph representations from visual scenes},
  author={Bear, Daniel and Fan, Chaofei and Mrowca, Damian and Li, Yunzhu and Alter, Seth and Nayebi, Aran and Schwartz, Jeremy and Fei-Fei, Li F and Wu, Jiajun and Tenenbaum, Josh and others},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

Learning Physical Graph Representations from Visual Scenes

Daniel Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter, Aran Nayebi, Jeremy Schwartz, Li F. Fei-Fei, Jiajun Wu, Josh Tenenbaum, Daniel L. Yamins