Distilling Internet-Scale Vision-Language Models into Embodied Agents

Theodore Sumers, Kenneth Marino, Arun Ahuja, Rob Fergus, Ishita Dasgupta

Research output: Contribution to journalConference articlepeer-review


Instruction-following agents must ground language into their observation and action spaces. Yet learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.

Original languageEnglish (US)
Pages (from-to)32797-32818
Number of pages22
JournalProceedings of Machine Learning Research
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: Jul 23 2023Jul 29 2023

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability


Dive into the research topics of 'Distilling Internet-Scale Vision-Language Models into Embodied Agents'. Together they form a unique fingerprint.

Cite this