SORNet

Spatial Object-Centric Representations for Sequential Manipulation


Wentao Yuan1
Chris Paxton2
Karthik Desingh1
Dieter Fox1,2
1University of Washington
2NVIDIA

Overview

Sequential manipulation tasks require a robot to constantly reason about spatial relationships among entities in the scene. Prior works relying on explicit state estimation or end-to-end learning struggle with novel objects or novel tasks. Thus, we propose SORNet (Spatial Object-Centric Representation Network), which enables zero-shot generalization to unseen objects on spatial reasoning tasks.

Spatial Object Centric Network

SORNet consists of two parts. The embedding network extracts object-centric embeddings from RGB images conditioned on canonical views of the objects of interest. The redaout network takes the embedding vectors and predicts discrete or continuous spatial relations among entities in the scene. Note that the object queries (canonical object views) can be captured in scenarios different from the input image (e.g. with different lighting and camera view).

Downstream Tasks

The object-centric embedding produced by SORNet enables zero-shot generalization to unseen objects on a variety of downstream tasks, including predicting spatial relationships, classifying skill preconditions and regressing relative direction from the end-effector to the object center.

Spatial Relationship Prediction on CLEVR-CoGenT

Skill Precondition Classification in a real-world tabletop manipulation scene

Visual Servoing using predicted 3D direction from the end-effector to the object center

Paper & Code

Citation

@inproceedings{yuan2021sornet,
    title = {SORNet: Spatial Object-Centric Representations for Sequential Manipulation},
    author = {Wentao Yuan and Chris Paxton and Karthik Desingh and Dieter Fox},
    booktitle = {5th Annual Conference on Robot Learning},
    pages = {148--157},
    year = {2021},
    organization = {PMLR}
}