Abstract

SlotVLA introduces object-relation representations for robotic manipulation. The work uses a slot-based visual tokenizer and relation-centric decoding to produce compact object-centric representations for multitask manipulation policies.

The project pairs the model with LIBERO+, a benchmark designed to evaluate fine-grained object-relation reasoning in manipulation tasks with object-centric annotations and temporal tracking.