Fan Zhang and Michael Gienger, "Affordance-based Robot Manipulation with Flow Matching", Arxiv, 2024.
AbstractWe present a framework for assistive robot manipulation, which focuses on two fundamental challenges: first, effi- ciently adapting large-scale vision-language models to down- stream scene affordance understanding tasks, especially in daily living scenarios where gather multi-task data involving human requires strenuous effort; second, effectively learning robot trajectories by grounding the visual affordance model. We tackle the first challenge by employing a parameter- efficient prompt tuning method that prepends learnable text prompts to the frozen vision model to predict manipulation affordance in multi-task scenarios. Then we propose to learn robot trajectories guided by affordance in a single supervised policy using Flow Matching, which is capable of handling multimodal robot action distributions in high-dimensional spaces. Finally, we introduce a real-world dataset with 10 tasks across Activities of Daily Livings to test our unified framework for efficient affordance and policy learning. Our extensive evaluation highlights that the proposed prompt tun- ing method for learning manipulation affordance with lan- guage prompter achieves competitive performance and even outperforms other finetuning protocols across data scales, while satisfying parameter efficiency. Supervised learning robot trajectories using Flow Matching also leads to consis- tently better performance than alternative behaviour cloning methods in training and generalization processes.