Existing definitions of segmentation may be insufficient for physical reasoning tasks
An ongoing challenge in image segmentation is defining segments with arbitrary categories, as existing segmentation datasets like COCO and ADE20K rely on semantic labels (e.g., car, tree, sky) to define segments. Although these are useful for recognition tasks, the resulting masks often do not reflect how objects move or interact in the real world, effectively limiting their utility in robotics tasks like object manipulation, which require physical reasoning capabilities to understand which parts of a scene move together.