VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought


NeurIPS Spotlight Paper

Carnegie Mellon University Google Deepmind



Tease Image

Abstract

Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience from sub-optimal demonstrations and human feedback. Given a task demonstration that may contain inefficiencies or mistakes, a VLM abstracts the trajectory into a generalized program of thoughts by correcting inefficient actions and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These programs of thought are iteratively improved and adapted through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting examples, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Moreover, as the agent's library of examples grows, it becomes more efficient, relying less on human feedback and requiring fewer environment interactions per demonstration. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7% using GPT4V. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on manual prompt engineering and consistently outperforms in-context learning from action plans that lack such programs of thought.


ICAL Method

In-Context Abstraction Learning (ICAL) aims at automating the acquisition of generalizable examples and knowledge for in-context agents. ICAL operates by receiving a language instruction I with a noisy trajectory of observations and actions, denoted ξnoisy = {o0, a0, …, oT, aT} in a new task domain D.

A new domain D represents changes in task variables not captured in VLM pretraining, such as a different environment (e.g., kitchen #1 vs. kitchen #2), task (e.g., "add the cheapest red bike to my wish list"), or user preference (e.g., "I prefer the red cup for coffee"). The core aim of is to abstract each noisy trajectory into a single example e, which then forms part of a memory set M. Each example e ∈ M represents an optimized trajectory ξoptimized with generalizable language abstractions L. The objective is to ensure that M collectively encapsulates examples that, when used in a VLMs context window, increase the likelihood of successful task execution in the new domain, while also containing knowledge that is transferable across similar tasks and contexts. This can be encapsulated as:

\[ \max_{E} \mathbb{E}[R | e, I, o_t, D] \]

where R is the return or the cumulative reward acquired by performing actions based on the instruction I, observation ot, and in-context example memory set M.

ICAL learns effectively by generating programs of thought:
  1. Task and Causal Abstractions: Essential principles and actions are identified to achieve goals, explaining how elements interconnect through cause and effect.
  2. State Changes: Crucial for decision-making, predicted state changes during a demonstration are annotated, showing the dynamic environment, such as a bowl becoming clean.
  3. Task Decomposition and Subgoals: Tasks are broken down into steps and subgoals with natural language annotations detailing each step and summarizing actions to facilitate understanding.
  4. State Abstraction: Focuses on selecting only relevant state variables interacted with during a demonstration, and suggesting additional relevant state variables, avoiding overwhelming detail and enhancing performance.

ICAL processes noisy demonstrations in two phases:

(1) the abstraction phase where a VLM is prompted to correct errors and generate language and visual abstractions: \[ F_{abstract}: (\xi_{noisy}, I, \{e^1, \ldots, e^k\}) \rightarrow (\xi_{optimized}, L) \] (2) the human-in-the-loop phase, where the abstractions are refined guided by human feedback \(H(a_t, o_t)\). The model update can be represented as \(\Xi_{\text{update}}\): \[ \Xi_{update}(\xi_{optimized}, H(a_t, o_t), L, I, \{e^1, ..., e^k\}) \rightarrow \xi'_{optimized}, L' \] Successful trajectories are stored in a repository to assist the agent in learning and responding to new instructions and environments.



Retrieval Augmented Generation using Abstracted Examples

After the ICAL examples have been learned, ICAL is deployed for new tasks and environments using retrieval-augmented generation.

Given the learned example set M and a new instruction \(I\), we prompt the VLM to carry out the instruction by producing action sequences \(\{a_0, ..., a_T\} \in A\) from an action API that describes the skills set \(A\) (e.g., go_to(X), pickup(X)), by retrieving the top \(K\) examples from M to include in the prompt based on their textual and visual similarity with the current scene.

The aggregated similarity score \(s\) for each example \(e\) reads:

\(s = \lambda_{I} \cdot s^I + \lambda_{\text{textual}} \cdot s^{\text{textual}} + \lambda_{\text{visual}} \cdot s^{\text{visual}}\)

where \(s^I\), \(s^{\text{textual}}\), and \(s^{\text{visual}}\) are the similarity scores for the input text instruction, textual state, and visual state, respectively, computed via cosine similarity using embeddings from OpenAI's text-embedding-ada-002 model and CLIP ViT-B/32 model. The coefficients \(\lambda_{I}\), \(\lambda_{\text{textual}}\), and \(\lambda_{\text{visual}}\) are weighting hyperparameters chosen in each domain by a held out validation set.

The VLM prompt contains the new instruction \(I\), the current webpage image for VisualWebArena or 12 video frames for Ego4D annotated with Set-of-Marks, a textual state description \(x_t\) describing the objects and their attributes for embodied agents and HTML elements for web agents, the action API \(A\), and the retrieved set of in-context examples \(\{e^1,...,e^k\} \in \textit{M}\).


Results

TEACh: Household Instruction Following

Below are results on the TEACh household instruction following validation unseen dataset. ICAL examples significantly improves on the state-of-the-art by 12.6% in goal-condition success, outperforming agents that use the raw visual demonstrations as in context examples without abstraction learning. The left plot displays Goal Condition Success and the right plot displays Success Rate.


VisualWebArena: Autonomous Visual Web Agents

ICAL outperforms the previous state-of-the-art GPT4V + SoM (Koh et. Al., 2023) on the VisualWebArena benchmark, improving from 14.3% to 22.7% in average success rate. The baseline utilizes GPT4V with few-shot hand-designed examples and set of marks image prompting.


Ego4D: Video Action Forecasting

ICAL demonstrates superior performance on Ego4D action anticipation compared to hand-written few-shot GPT4V that uses chain of thought. ICAL also remains competitive with the fully supervised baseline (grauman et. Al., 2022) despite using 639x less training data.



Additional Takeaways

Continual Learning

ICAL shows continual improvements in TEACh unseen success rate with more examples learned. Our method benefits from even a small amount of examples learned, with an improvement of an absolute 14.7% success rate over chain-of-thought prompting with just 10 examples learned.

*values interpolated for visualization


LoRA finetuning on ICAL examples complements retrieval

The most notable advancement was achieved when LoRA fine-tuning was combined with retrieval-augmented generation, achieving a Success rate of 35.4% and a sub-task score of 55.9%. This demonstrate that consolidating the learned examples through fine-tuning, particularly when integrated with retrieval processes through in-context learning, improves performance.


What's Important: Abstractions or Human-In-The-Loop?

Our ablations reveal the importance of both the offline and human-in-the-loop abstraction making for performance.



ICAL Abstraction Examples

Below we show examples of the raw trajectories obtained from human demonstrations, and the ICAL abstracted examples on the right. Click on a task tab to show the example before and after applying ICAL. You can scroll to view the full abstractions.

TEACh: Household Instruction Following


Agent Video:
Raw Demonstration
ICAL Abstracted Example
Text for Tab 2
Agent Video:
Raw Demonstration
ICAL Abstracted Example
Text for Tab 2
Agent Video:
Raw Demonstration
ICAL Abstracted Example
Text for Tab 3
Agent Video:
Raw Demonstration
ICAL Abstracted Example
Text for Tab 3
Agent Video:
Raw Demonstration
ICAL Abstracted Example
Text for Tab 3



VisualWebArena: Autonomous Visual Web Agents


Webpage image:
Raw Demonstration
ICAL Abstracted Example
Webpage image:
Raw Demonstration
ICAL Abstracted Example
Webpage image:
Raw Demonstration
ICAL Abstracted Example
Webpage image:
Raw Demonstration
ICAL Abstracted Example
Webpage image:
Raw Demonstration
ICAL Abstracted Example




Ego4D: Video Action Forecasting


Video:
Raw Demonstration
ICAL Abstracted Example
Text for Tab 2
Video:
Raw Demonstration
ICAL Abstracted Example
Video:
Raw Demonstration
ICAL Abstracted Example
Video:
Raw Demonstration
ICAL Abstracted Example
Video:
Raw Demonstration
ICAL Abstracted Example

ICAL Video Demos


Instruction: Find me powder to make the beverage that is the same as the picture:


Instruction: What is the email of the seller of the red pallete on this page?

Human-in-the-Loop Demo

Instruction: Can you search for "Cheerios", and add the family sized blue Cheerios cereal to my cart and order it only if the total comes out to less than $43?

Instruction: <Driver> hello. <Driver> task please. <Commander> We have a lot to do! Hello! <Commander> We need to wash a mug and fill it with coffee. <Driver> ok. <Commander> The mug is on the island on a plate. <Commander> Great. Now take it to the sink to clean it. <Commander> good work. <Commander> Now we need to add the coffee. <Driver> done. <Commander> Good job! <Driver> next please. <Commander> We need to find a knife and the bread. <Driver> have knife where is bread? <Commander> The bread is in the fridge. <Commander> We need two slices of bread toasted. <Driver> done. <Driver> ok. <Driver> done. <Commander> Grab the knife again. <Commander> We need to slice the tomato and lettuce. <Commander> The tomato and lettuce need to be on the plate with the bread. <Commander> Good work! Have a great day! <Driver> done. <Driver> thank.

Instruction: <Commander> boil whole potatoes in water. <Driver> hello. <Commander> potato. <Commander> hi. <Commander> it's in the lower drawer to the left of the cooking stove. <Driver> the knife? <Commander> don't cut. boil it whole. <Driver> where do I boil it? <Commander> in the pot on the stove. <Commander> awesome. task complete.




See our paper for more!

Citation

@inproceedings{sarch2024ical,
                        title = "VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought",
                        author = "Sarch, Gabriel and
                        Jang, Lawrence and
                        Tarr, Michael and
                        Cohen, William and
                        Marino, Kenneth and
                        Fragkiadaki, Katerina",
                        booktitle = "",
                        year = "2024"}