Abstract

This project studies how vision-language-action models can remain reliable when robot manipulation scenes are cluttered, visually ambiguous, or missing the target object. OBEYED-VLA adds object-centric and geometry-grounded views so that the policy can focus on instruction-relevant entities and spatial evidence.

The method combines multi-view perception, task-conditioned object grounding, and masked-depth information before action decoding, improving robustness beyond purely end-to-end visual-language policies.