Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.
In-Context Abstraction Learning (ICAL) aims at automating the acquisition of generalizable examples and knowledge for in-context agents. ICAL operates by receiving a language instruction I with a noisy trajectory of observations and actions, denoted ξnoisy = {o0, a0, …, oT, aT} in a new task domain D.
A new domain D represents changes in task variables not captured in VLM pretraining, such as a different environment (e.g., kitchen #1 vs. kitchen #2), task (e.g., "add the cheapest red bike to my wish list"), or user preference (e.g., "I prefer the red cup for coffee"). The core aim of
\[ \max_{E} \mathbb{E}[R | e, I, o_t, D] \]
where R is the return or the cumulative reward acquired by performing actions based on the instruction I, observation ot, and in-context example memory set M.
ICAL learns effectively by generating 4 types of abstractions:ICAL processes noisy demonstrations in two phases:
(1) the abstraction phase where a VLM is prompted to correct errors and generate language and visual abstractions: \[ F_{abstract}: (\xi_{noisy}, I, \{e^1, \ldots, e^k\}) \rightarrow (\xi_{optimized}, L) \] (2) the human-in-the-loop phase, where the abstractions are refined guided by human feedback \(H(a_t, o_t)\). The model update can be represented as \(\Xi_{\text{update}}\): \[ \Xi_{update}(\xi_{optimized}, H(a_t, o_t), L, I, \{e^1, ..., e^k\}) \rightarrow \xi'_{optimized}, L' \] Successful trajectories are stored in a repository to assist the agent in learning and responding to new instructions and environments.
After the ICAL examples have been learned, ICAL is deployed for new tasks and environments using retrieval-augmented generation.
Given the learned example set M and a new instruction \(I\), we prompt the VLM to carry out the instruction by producing action sequences \(\{a_0, ..., a_T\} \in A\) from an action API that describes the skills set \(A\) (e.g., go_to(X)
, pickup(X)
), by retrieving the top \(K\) examples from M to include in the prompt based on their textual and visual similarity with the current scene.
The aggregated similarity score \(s\) for each example \(e\) reads:
\(s = \lambda_{I} \cdot s^I + \lambda_{\text{textual}} \cdot s^{\text{textual}} + \lambda_{\text{visual}} \cdot s^{\text{visual}}\)
where \(s^I\), \(s^{\text{textual}}\), and \(s^{\text{visual}}\) are the similarity scores for the input text instruction, textual state, and visual state, respectively, computed via cosine similarity using embeddings from OpenAI's text-embedding-ada-002
model and CLIP ViT-B/32
model. The coefficients \(\lambda_{I}\), \(\lambda_{\text{textual}}\), and \(\lambda_{\text{visual}}\) are weighting hyperparameters chosen in each domain by a held out validation set.
The VLM prompt contains the new instruction \(I\), the current webpage image for VisualWebArena or 12 video frames for Ego4D annotated with Set-of-Marks, a textual state description \(x_t\) describing the objects and their attributes for embodied agents and HTML elements for web agents, the action API \(A\), and the retrieved set of in-context examples \(\{e^1,...,e^k\} \in \textit{M}\).
Below are results on the TEACh household instruction following validation unseen dataset. ICAL examples significantly improves on the state-of-the-art by 12.6% in goal-condition success, outperforming agents that use the raw visual demonstrations as in context examples without abstraction learning. The left plot displays Goal Condition Success and the right plot displays Success Rate.
ICAL outperforms the previous state-of-the-art GPT4V + SoM (Koh et. Al., 2023) on the VisualWebArena benchmark, improving from 14.3% to 22.7% in average success rate. The baseline utilizes GPT4V with few-shot hand-designed examples and set of marks image prompting.
ICAL demonstrates superior performance on Ego4D action anticipation compared to hand-written few-shot GPT4V that uses chain of thought. ICAL also remains competitive with the fully supervised baseline (grauman et. Al., 2022) despite using 639x less training data.
ICAL shows continual improvements in TEACh unseen success rate with more examples learned. Our method benefits from even a small amount of examples learned, with an improvement of an absolute 14.7% success rate over chain-of-thought prompting with just 10 examples learned.
*values interpolated for visualization
The most notable advancement was achieved when LoRA fine-tuning was combined with retrieval-augmented generation, achieving a Success rate of 35.4% and a sub-task score of 55.9%. This demonstrate that consolidating the learned examples through fine-tuning, particularly when integrated with retrieval processes through in-context learning, improves performance.
Our ablations reveal the importance of both the offline and human-in-the-loop abstraction making for performance.
Below we show examples of the raw trajectories obtained from human demonstrations, and the ICAL abstracted examples on the right. Click on a task tab to show the example before and after applying ICAL. You can scroll to view the full abstractions.
Instruction: Find me powder to make the beverage that is the same as the picture:
Instruction: What is the email of the seller of the red pallete on this page?
Instruction: Can you search for "Cheerios", and add the family sized blue Cheerios cereal to my cart and order it only if the total comes out to less than $43?
@inproceedings{sarch2024ical,
title = "ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights",
author = "Sarch, Gabriel and
Jang, Lawrence and
Tarr, Michael and
Cohen, William and
Marino, Kenneth and
Fragkiadaki, Katerina",
booktitle = "",
year = "2024"}