Segmenting real-world scenes without supervision is a critical, unsolved problem in computer vision. Here, we introduce Spelke Object Inference, an unsupervised method for perceptually grouping objects in static images by predicting which parts of a scene would move as cohesive wholes. (We call these wholes “Spelke objects” after ideas from the well-known cognitive scientist Elizabeth Spelke.)

In principle, it is straightforward to use dynamic signals to learn what is a cohesive object. However, standard neural network architectures lack good grouping primitives that can truly take advantage of Spelke Object Inference. Thus our main contribtion is the design of a neural grouping operator and a differentiable model based on it, the Excitatory-Inhibitory Segment Extraction Network (EISEN). EISEN learns to segment static, real-world scenes from motion training signals, greatly improving the state-of-the-art in this challenging domain.

Spelke Object Inference

Our method is based on a key idea from studies of human development. Spelke and colleagues showed that infants initially group scene elements based on coherent motion, but as they mature begin to use static cues, such as the classical Gestalt properties [1,2]. Infants may be learning static cues that predict which elements move as a unit – a natural case of self-supervision that could inform segmentation algorithms.

However, actual implementations of this process face two challenges. First, most objects are not moving at any given moment, so learning signals appied to static scene elements may be misleading; and second, many objects only move when they are moved by an agent – in which case the object’s and agent’s motion need to be disentangled.

EISEN’s architecture and training procedure are designed to address these challenges.

Two Challenges of Spelke Object Inference
Two challenges arise in applying Spelke Object Inference to real-world scenes. First, most Spelke Objects do not move at any given time (e.g. the carrot in cyan.) Second, it is often hard to separate the motion of an agent from the motion of the object it is acting on (e.g. the yellow robot arm and pink pot lid.)

The Architecture of EISEN

Perceptual grouping is fundamentally a form of relational inference: for each element of a scene, which other elements are part of the same object? EISEN therefore predicts, from a static image, a graph of (mostly local) pairwise affinities that represent physical connectivity between scene elements.

This architectural choice resolves the first challenge above: pairs of elements moving together are trained to have high affinity, pairs moving independently to have low affinity, and pairs of two static scene elements not trained, because there is nothing to learn about their affinity.

architecture static
EISEN learns pairwise affinities based on motion signals from a pretrained RAFT teacher model. Because physical connectivity cannot be inferred from static scene elements, only pairs with at least one moving element are trained (in the first round, see below.)

A Neural Grouping Primitive: Kaleidoscopic Propagation + Competition

But pairwise affinities are not yet objects – they need to be grouped. EISEN therefore applies a pair of neural grouping modules, Kaleidoscopic Propagation (KProp) and Competition, to extract objects from a scene’s affinity graph.

Kaleidoscopic Propagation (KProp) first randomly initializes a retinotopic array of high-dimensional vector nodes, which represent scene elements. It then recurrently propagates two types of messages on the affinity graph:

Because physical connectivity is transitive, over many iterations this propagation produces “plateaus” of nodes with nearly the same vector and sharp edges where the scene has an object border.

Competition then picks out well-formed node clusters simply by dropping “object pointers” on this plateau map. Each pointer yields an object segment, corresponding to all the nodes that have roughly the same vector as the one at the pointer’s position. Because some segments may be redundant, several rounds of winner-take-all dynamics deactivate pointers whose territory overlaps with that of another pointer.

Affinity prediction, KProp, and Competition. KProp and Competition have no trainable parameters. They thus work as a general-purpose grouping primitive for introducing "objects" into any visual representation that looks like an affinity graph.

Learning to Segment Simulated and Real-world Images

EISEN learns from motion to segment objects in single, static frames from three challenging datasets:

  1. ThreeDWorld Playroom, with simulated scenes containing multiple complex, realistically rendered objects;
  2. DAVIS 2016, with real scenes of objects moving against complex backgrounds (evaluated on single frames);
  3. Bridge Data, a real-world dataset with a human-controlled robot arm interacting with objects in a cluttered “kitchen.”

ThreeDWorld Playroom Dataset

tdw
Baselines methods frequently lump, miss, and distort the shapes of objects. EISEN is able to capture fine details (e.g. the chair and giraffe legs) and segment closely spaced objects of similar appearance, (e.g. the zebras.)

DAVIS: Densely Annotated VIdeo Segmentation 2016

DAVIS
EISEN can segment static objects that were not seen during training. Even though the motorcycle and rider are separate physical objects, it is not surprising that EISEN groups them together since it has not observed them moving independently during training.

Real-World Robotics Dataset (“Bridge” data)

EISEN segmentations of Bridge
EISEN learns to segment Spelke Objects -- collections of physical stuff that move coherently -- in real-world images. Baseline models struggle to detect and accurately group most of the objects.

In Bridge, most objects move only when the robot arm moves them. To decouple agent from object motion during training – the second challenge of applying Spelke Object Inference – we introduce a form of top-down inference into EISEN’s learning curriculum. Because EISEN sees many cases of the moving agent (e.g., Bridge’s robot arm) early in training, it naturally learns to segment this first. Thus, later in training, the agent can be “explained away” from raw motion signals, isolating inanimate objects for affinity learning. We implement this as a simple, round-based curriculum in which EISEN automatically learns to segment more and more objects from static cues alone.

Learning general grouping priors. EISEN can accurately segment static Spelke objects in the Playroom and DAVIS2016 datasets that it has not seen at all, let alone seen moving, during training. It may do this by capturing, in its pairwise affinities, fairly generic aspects of real-world “objectness” like spatial proximity and textural similarity. If this is so, EISEN could serve as a mechanistic model of how these static cues come to characterize perceptual grouping in adults, whereas coherent motion predominates in infants [1].

Reference

To cite our work, please use:

@InProceedings{chen2022unsupervised,
author = {Chen, Honglin and Venkatesh, Rahul and Friedman, Yoni and Wu, Jiajun and Tenenbaum, Joshua B and Yamins, Daniel LK and Bear, Daniel M},
title = {Unsupervised Segmentation in Real-World Images via Spelke Object Inference},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

[1] Spelke, E.S., 1990. Principles of object perception. Cognitive science, 14(1), pp.29-56.
[2] Carey, S. and Xu, F., 2001. Infants’ knowledge of objects: Beyond object files and object tracking. Cognition, 80(1-2), pp.179-213.