Discovering and using Spelke segments

0:00 / 0:00

Overview of SpelkeNet's capabilities. On the left, our model first predicts a probability of motion map, indicating regions likely to move if an external force is applied. We sample a point from this map, apply a virtual poke, and have the model complete a flow field revealing which other pixels will move as a result of this poke. From this, we can naturally extract a group of pixels which move together—i.e. a Spelke segment. On the right, we illustrate how these discovered segments can enable more physically plausible object manipulation as compared to SAM.

Existing definitions of segmentation may be insufficient for physical reasoning tasks

An ongoing challenge in image segmentation is defining segments with arbitrary categories, as existing segmentation datasets like COCO and ADE20K rely on semantic labels (e.g., car, tree, sky) to define segments. Although these are useful for recognition tasks, the resulting masks often do not reflect how objects move or interact in the real world, effectively limiting their utility in robotics tasks like object manipulation, which require physical reasoning capabilities to understand which parts of a scene move together.

Spelke segments based on motion could address this limitation

We draw from developmental psychology the notion of Spelke objects—groupings of physical entities that reliably move together under applied forces, a concept first established by Liz Spelke in Principles of Object Perception. Unlike existing segmentation definitions, Spelke objects are defined by category-agnostic causal motion relationships, making them inherently better suited to support robotics tasks such as manipulation and planning.

Benchmarking Spelke segments

We first benchmark the Spelke segment concept by introducing SpelkeBench: a 500-image evaluation dataset designed to assess whether segmentation algorithms can identify Spelke segments. As shown below, existing segmentation annotation methods like SAM and EntitySeg frequently contain segments that do not represent units that would intuitively move together in the real world (e.g., camera subcomponents and bottle labels), demonstrating that current benchmarks fail to capture the notion of Spelke segments. We curate the dataset from two complementary sources: EntitySeg, featuring high-resolution internet imagery with dense segmentation annotations, and Open X-Embodiment, consisting of real-world egocentric robot interactions. This contrast enables evaluation across both unconstrained natural image domains and physically grounded robotics environments.

Comparison between datasets.
Example segments from SpelkeBench. We compare this with SAM and EntitySeg segments to show that SpelkeBench contains annotated segments which align more with the Spelke notion.

SpelkeNet: Operationalizing Spelke segmentation

We build SpelkeNet, a model that learns to complete flow fields and implicitly captures how objects move together in the physical world. SpelkeNet is an instance of Local Random Access Sequence Modeling (LRAS), a sequence modeling framework inspired by LLMs that causally predicts locally quantized image (i.e. RGB) and optical flow patches. SpelkeNet leverages LRAS's flexible, autoregressive sequence design properties to apply sparse, localized interventions simply by appending to the input sequence a flow token representing a virtual poke and discovers Spelke objects by completing the flow field which indicates how the rest of the scene will respond. These kind of localized interventions have proven difficult to perform with other classes of world models built using diffusion which require dense global conditioning. Thus, the LRAS sequence modeling framework provides an ideal foundation for Spelke object discovery.

LRAS Architecture
SpelkeNet architecture. On the left we illustrate SpelkeNet—an instance of the LRAS framework applied to optical-flow completion. The input is a tokenized RGB image and a sparse virtual poke, with each token being paired with a pointer token indicating spatial location. The model predicts a categorical distribution \( \mathcal{D}[i_j] \) over flow tokens for every spatial location \( i_j \). The right panel shows autoregressive sampling yielding a complete flow field—at each step we select an undecoded location, sample a flow token from the predicted distribution at that location, and append it to the input sequence to be fed back into the model to generate an updated distribution. This process repeats, progressively completing the flow field, which represents how the scene responds to the virtual poke. This enables Spelke segment discovery through analyzing the motion correlation patterns of the resulting flow fields.

SpelkeNet enables the extraction of two key structures to discover Spelke objects

Motion affordance maps. To discover Spelke objects, we must first identify which pixels correspond to candidate movable entities within a scene in order to apply meaningful virtual pokes. Motion affordance maps prove especially valuable in robotics applications for identifying high-motion regions independent of camera motion that respond to interaction (e.g., cups or plates) and excluding low-motion regions (e.g., sky or ground) which do not typically move upon external forces. We refer to this notion as the probability of motion affordance map, denoted \( p_{\text{motion}} \), which is a structure that we can extract from SpelkeNet. To compute \( p_{\text{motion}} \), we sum the predicted flow token distributions at each spatial location for all tokens which represent non-zero motion to yield a 2D heatmap of regions likely to move under external forces.

1 / 8
Motion affordance maps. Here we illustrate input images and their corresponding probability of motion heatmaps showing regions likely to exhibit motion under externally applied forces.

Expected displacement maps. Once we apply a virtual poke at a suitable location, we obtain a predicted distribution, which we then compute as a probability-weighted average over the flow vectors, giving us a dense 2D vector field over spatial locations. We call this the expected displacement map, denoted \( \mathbb{E}_{\text{disp}} \), which is an estimate of likely flow at each location conditioned on a virtual poke. This map provides guidance about how objects might move if interacted with, enabling robots to predict interaction outcomes without executing actions in the physical world. This proves especially valuable in robotics applications where understanding the effects of actions before physical contact is crucial.

1 / 5
Expected displacement maps. Dense vector fields that represent how each pixel in a scene would move in response to a virtual poke applied at a specific location.

Statistical counterfactual probing for Spelke object discovery

Using these structure extractions, we first sample a location that is likely to move from \( p_{\text{motion}} \) and apply various virtual pokes at this location in order to identify regions that consistently move together. We then compute the expected motion correlation by averaging across various pokes the dot product between the poke vector and associated \( \mathbb{E}_{\text{disp}} \). Finally, Otsu's thresholding on the averaged dot product yields our desired Spelke segment.

1 / 3
Spelke object discovery algorithm. Our approach can discover multiple objects within a scene and produces more meaningful segments that align with the Spelke concept as compared to the SAM segments.

SpelkeNet achieves state-of-the-art performance on SpelkeBench. To evaluate our model's ability to discover Spelke objects, we formalize the task as point-promoted segmentation: given a single point on an object, the goal is to predict the Spelke segment associated with that point. We find that SpelkeNet outperforms both self-supervised baselines (DINO, CWM) and the supervised SAM2 model on SpelkeBench.

SAM2 DINOv1-B/8 DINOv2-L/14 DINOv2-G/14 CWM SpelkeNet
AR 0.4816 0.2708 0.2524 0.2254 0.3271 0.5411
mIoU 0.6225 0.4990 0.4931 0.4553 0.4807 0.6811
Evaluation of point-prompted segmentation accuracy. We report Average Recall (AR) and mean Intersection over Union (mIoU) for various segmentation methods.

Using Spelke segments for physically plausible object manipulation

We consider the task of object-centric scene editing, where a user clicks a point on an object and provides an edit prompt specifying a 2D or 3D transformation. A segmentation model generates an object edit mask from the point selection, and the object manipulation method produces an edited image after applying the specified transformation to the object mask area. We show that SpelkeNet segments enable more physically plausible object editing as opposed to existing segmentation methods like SAM, which often split up or combine objects in ways that are inconsistent with how they move.

1 / 3
Comparison of SpelkeNet and SAM in an object manipulation pipeline. To perform object manipulation, we provide a segment map and a 3D edit prompt to LRAS-3D, a state-of-the-art object manipulation model. Here, we show that SpelkeNet segments yield more physically plausible edits than SAM edits.

Across various object manipulation pipelines, Spelke segments yield more physically plausible object edits. To evaluate the utility of SpelkeNet segments for object manipulation, we replace SAM with SpelkeNet in the pipeline for existing object editing models. Specifically, we use 3DEditBench, a recently introduced benchmark for evaluating physically plausible object manipulation. From the table below, we show that for existing object manipulation methods, SpelkeNet segments improve performance significantly.

Method Segment MSE ↓ PSNR ↑ LPIPS ↓ SSIM ↑ EA ↑
LRAS-3D SpelkeNet 0.009 21.64 0.213 0.698 0.776
SAM 0.013 20.17 0.255 0.685 0.633
LightningDrag SpelkeNet 0.017 19.16 0.195 0.672 0.679
SAM 0.020 18.18 0.241 0.658 0.536
Diffusion Handles SpelkeNet 0.024 17.42 0.364 0.555 0.576
SAM 0.031 16.15 0.419 0.526 0.495
Diffusion as Shader SpelkeNet 0.015 19.29 0.194 0.707 0.640
SAM 0.019 18.20 0.253 0.682 0.503
Evaluation of edit quality across segmentation methods and editing pipelines. We report results for edits generated using SAM versus SpelkeNet segments across four editing methods. Lower ↓ is better, higher ↑ is better.

Emergent properties of SpelkeNet

We have demonstrated SpelkeNet's utility in discovering Spelke segments and shown how these segments enable more realistic scene edits in object manipulation pipelines. Beyond these applications, SpelkeNet also exhibits emergent properties that reveal a deeper understanding of physical scene structure.

Material Understanding
Material property understanding. The generated \( p_{\text{motion}} \) maps can be used to infer physical attributes such as rigidity or material type. Rigid objects like laptops and cardboard boxes tend to exhibit a uniform probability across the segment, while deformable objects such as cloth and plastic covers often show more localized motion responses near the poke point.
Support Relationships
Support relationships understanding. When applying a virtual poke to an object within a stack (e.g., the saucer), the extracted Spelke segment includes both the poked object and all the objects it physically supports, implying an understanding of support hierarchies within a scene.

BibTeX

@misc{venkatesh2025discoveringandusingsegments,
  title        = {Discovering and using Spelke segments}, 
  author       = {Rahul Venkatesh and Klemen Kotar and Lilian Naing Chen and Seungwoo Kim and Luca Thomas Wheeler and Jared Watrous and Ashley Xu and Gia Ancone and Wanhee Lee and Honglin Chen and Daniel Bear and Stefan Stojanov and Daniel Yamins},
  year         = {2025},
  eprint       = {TODO},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url          = {TODO}, 
}