Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies

Jan 25, 25250·

Jianing Qian

Anastasios Panagopoulos

Dinesh Jayaraman

· 0 min read

PDF Cite URL

Abstract

Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose SOFT, a wrapper around pre-trained vision transformer PVT models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric representation. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained on SOFT(PVT) far outstrip standard PVT representations for manipulation tasks in simulated and real settings, approaching the state-of-the-art robotics-aware representations.

Type

Publication

ICRA

Last updated on Jan 25, 252535

← Eurekaverse: Environment Curriculum Generation via Large Language Models Sep 1, 1010

Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models Jan 24, 24240 →