Large-scale LLMs and VLMs excel at few-shot learning but require high-quality examples. We introduce In-Context Abstraction Learning (ICAL), which iteratively refines suboptimal trajectories into high-quality data with optimized actions and detailed reasoning. Given an inefficient demonstration, a VLM corrects actions and annotates causal relationships, object states, subgoals, and task-relevant visuals, forming "programs of thought." With human feedback, these programs are improved as the agent executes them in a similar environment. The resulting examples, used as prompt context or fine-tuning data, significantly boost decision-making while reducing human feedback needs. ICAL surpasses state-of-the-art in TEACh (dialogue-based instruction following), VisualWebArena (multimodal web agents), and Ego4D (egocentric video action anticipation). In TEACh, combining fine-tuning and retrieval on ICAL examples outperforms raw human demonstrations and expert examples, achieving a 17.5% increase in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with ICAL improves task success rate 1.6x over GPT-4V, while fine-tuning Qwen2-VL achieves a 2.8x improvement. In Ego4D, ICAL outperforms few-shot GPT-4V and remains competitive with supervised models. Overall, ICAL scales 2x better than raw human demonstrations and reduces manual prompt engineering.
In-Context Abstraction Learning (ICAL) aims at automating the acquisition of generalizable examples and knowledge for in-context agents. ICAL operates by receiving a language instruction I with a noisy trajectory of observations and actions, denoted ξnoisy = {o0, a0, …, oT, aT} in a new task domain D.
A new domain D represents changes in task variables not captured in VLM pretraining, such as a different environment (e.g., kitchen #1 vs. kitchen #2), task (e.g., "add the cheapest red bike to my wish list"), or user preference (e.g., "I prefer the red cup for coffee"). The core aim of
\[ \max_{M} \mathbb{E}\left[ \sum_{t=0}^T r_t(o_t, a_t) \,\Big|\, a_t \sim \pi(a \mid h_t, M, I), D \right] \]
where:
ICAL processes noisy/suboptimal/incorrect trajectories, coming from a human or agent policy, in two phases:
(1) the abstraction phase where a VLM is prompted to correct action errors and generate initial programs of thought: \[ F_{abstract}: (\xi_{noisy}, I, \{e^1, \ldots, e^k\}) \rightarrow (\xi_{optimized}, L) \] (2) the human-in-the-loop phase, where the actions and programs of thought are refined guided by human feedback \(H(a_t, o_t)\). The model update can be represented as \(\Xi_{\text{update}}\): \[ \Xi_{update}(\xi_{optimized}, H(a_t, o_t), L, I, \{e^1, ..., e^k\}) \rightarrow \xi'_{optimized}, L' \] Successful trajectories are stored in a repository to assist the agent in learning and responding to new instructions and environments.
After the ICAL examples have been learned, ICAL uses retrieval-augmented generation or supervised fine-tuning to teach the agent to perform the new tasks.
Retrieval Augmented Generation (RAG)Given the learned example set M and a new instruction \(I\), we prompt the VLM to carry out the instruction by producing action sequences \(\{a_0, ..., a_T\} \in A\) from an action API that describes the skills set \(A\) (e.g., go_to(X)
, pickup(X)
), by retrieving the top \(K\) examples from M to include in the prompt based on their textual and visual similarity with the current scene.
The aggregated similarity score \(s\) for each example \(e\) reads:
\(s = \lambda_{I} \cdot s^I + \lambda_{\text{textual}} \cdot s^{\text{textual}} + \lambda_{\text{visual}} \cdot s^{\text{visual}}\)
where \(s^I\), \(s^{\text{textual}}\), and \(s^{\text{visual}}\) are the similarity scores for the input text instruction, textual state, and visual state, respectively, computed via cosine similarity using embeddings from OpenAI's text-embedding-ada-002
model and CLIP ViT-B/32
model. The coefficients \(\lambda_{I}\), \(\lambda_{\text{textual}}\), and \(\lambda_{\text{visual}}\) are weighting hyperparameters chosen in each domain by a held out validation set.
The VLM prompt contains the new instruction \(I\), the current webpage image for VisualWebArena or 12 video frames for Ego4D annotated with Set-of-Marks, a textual state description \(x_t\) describing the objects and their attributes for embodied agents and HTML elements for web agents, the action API \(A\), and the retrieved set of in-context examples \(\{e^1,...,e^k\} \in \textit{M}\).
Supervised Fine-tuning (SFT)Supervised fine-tuning trains the multimodal LLM, Qwen2-VL, to directly predict the ICAL programs of thought and corresponding action sequences for new tasks. The training dataset pairs task instructions \(I\), multimodal inputs \(S\) (e.g., visual frames, textual descriptions), and ground truth outputs, including intermediate reasoning steps and final actions.
The model learns to generate structured ICAL outputs and actions end-to-end, enabling it to perform tasks without relying on retrieval from learned examples.
Below are results on the TEACh household instruction following validation unseen dataset. ICAL examples significantly improves on the state-of-the-art by 12.6% in goal-condition success, outperforming agents that use the raw visual demonstrations as in context examples without abstraction learning. The left plot displays Goal Condition Success and the right plot displays Success Rate.
ICAL outperforms the previous state-of-the-art GPT4V + SoM (Koh et. Al., 2023) on the VisualWebArena benchmark, improving from 14.3% to 22.7% in average success rate. The baseline utilizes GPT4V with few-shot hand-designed examples and set of marks image prompting.
ICAL demonstrates superior performance on Ego4D action anticipation compared to hand-written few-shot GPT4V that uses chain of thought. ICAL also remains competitive with the fully supervised baseline (grauman et. Al., 2022) despite using 639x less training data.
ICAL scales 2x better than raw human demonstrations. ICAL shows continual improvements in TEACh unseen success rate with more examples learned. Our method benefits from even a small amount of examples learned, with an improvement of an absolute 14.7% success rate over chain-of-thought prompting with just 10 examples learned.
*values interpolated for visualization
ICAL requires less human feedback over time by leveraging learned examples. In TEACh, for the second half of examples processed, the model requires significantly fewer environment steps (436±88 vs. 267±43, p=0.0143) and human feedbacks (0.74±0.17 vs. 0.21±0.08, p=0.0089) per example.
In TEACh, using GPT3.5, the most notable advancement was achieved when LoRA fine-tuning was combined with retrieval-augmented generation, achieving a Success rate of 35.4% and a sub-task score of 55.9%. This demonstrate that consolidating the learned examples through fine-tuning, particularly when integrated with retrieval processes through in-context learning, improves performance.
Our ablations reveal the importance of both the offline and human-in-the-loop abstraction making for performance.
Below we show examples of the raw trajectories obtained from human or agent demonstrations, and the ICAL examples with optimized actions and programs of thought on the right. Click on a task tab to show the example before and after applying ICAL. You can scroll to view the full examples.
Instruction: Find me powder to make the beverage that is the same as the picture:
Instruction: What is the email of the seller of the red pallete on this page?
Instruction: Can you search for "Cheerios", and add the family sized blue Cheerios cereal to my cart and order it only if the total comes out to less than $43?
Instruction: <Driver> hello. <Driver> task please. <Commander> We have a lot to do! Hello! <Commander> We need to wash a mug and fill it with coffee. <Driver> ok. <Commander> The mug is on the island on a plate. <Commander> Great. Now take it to the sink to clean it. <Commander> good work. <Commander> Now we need to add the coffee. <Driver> done. <Commander> Good job! <Driver> next please. <Commander> We need to find a knife and the bread. <Driver> have knife where is bread? <Commander> The bread is in the fridge. <Commander> We need two slices of bread toasted. <Driver> done. <Driver> ok. <Driver> done. <Commander> Grab the knife again. <Commander> We need to slice the tomato and lettuce. <Commander> The tomato and lettuce need to be on the plate with the bread. <Commander> Good work! Have a great day! <Driver> done. <Driver> thank.
Instruction: <Commander> boil whole potatoes in water. <Driver> hello. <Commander> potato. <Commander> hi. <Commander> it's in the lower drawer to the left of the cooking stove. <Driver> the knife? <Commander> don't cut. boil it whole. <Driver> where do I boil it? <Commander> in the pot on the stove. <Commander> awesome. task complete.
@inproceedings{sarch2024vlm,
title={VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought},
author={Sarch, Gabriel Herbert and Jang, Lawrence and Tarr, Michael J and Cohen, William W and Marino, Kenneth and
Fragkiadaki, Katerina},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}