Purpose of this lecture
A VLMVision-Language Model that can describe images and answer questions is fundamentally a passive perception system. An agent, however, uses those perceptual capabilities to take actions—clicking UI elements, calling external APIs, browsing the web, or controlling a robot—to autonomously complete multi-step tasks in dynamic environments.
The transition from a passive VLMVision-Language Model to an active agent requires modeling decision-making under uncertainty, grounding abstract language intentions into actionable physical/digital affordances, and composing multiple discrete tools in service of a high-level goal. This lecture bridges the multimodal architectures developed over the last 10 weeks with the reinforcement learning theory from Course 1, examining how VLMs function as the policy engine within a Partially Observable Markov Decision Process (POMDPPartially Observable Markov Decision Process).
From Passive Model to Active Agent
A VLMVision-Language Model trained for perception answers the question: "Given this visual input, what is true?" An agent answers: "Given this visual input and a goal, what action should I take?" The distinction is not merely capability but feedback: an agent's actions change the state of the environment, which updates future visual observations, which subsequently influence future actions. This creates a sequential decision process rather than a one-shot prediction problem.
The Agent Loop: At each discrete time step :
- The agent receives an observation (e.g., a screenshot of a webpage, or a robot sensor reading).
- The agent takes an action (e.g., a mouse click coordinate, a keyboard input, an API call).
- The environment transitions to a new hidden state and returns a new observation and optionally a reward .
Context as Working Memory: Unlike a single-step VLMVision-Language Model query, an agent must maintain context across time. The agent's state representation at step consists of the overall goal , the history of past observations and actions , and the current observation . Modern multimodal agents simply pass this entire unrolled history into the VLMVision-Language Model's context window, effectively using the LLMLarge Language Model's Transformer context as its working memory.
The VLMVision-Language Model as a POMDPPartially Observable Markov Decision Process Policy
The agent's decision process is formally modeled as a Partially Observable Markov Decision Process (POMDPPartially Observable Markov Decision Process), connecting directly back to the foundational theory established in Course 1:
where is the unobserved true world state space, is the action space, is the transition dynamics, is the reward function, is the observation space (screenshots), is the observation probability, and is the discount factor.
The agent's policy maps the observation history and goal to a probability distribution over actions:
For a VLMVision-Language Model-based agent, the policy is physically implemented by the frozen (or fine-tuned) weights of the VLMVision-Language Model. The context is formatted as a multimodal prompt, and the VLMVision-Language Model autoregressively generates the next action as a structured text output (e.g., a JSON string describing the action type and parameters).
UI Grounding and Affordance Mapping
UI grounding is the task of identifying and interacting with interface elements from a screenshot—clicking buttons, filling forms, navigating menus—based on a natural language instruction (e.g., "click the Submit button", "type 'search query' in the search field").
UI grounding is vastly more difficult than standard visual bounding-box grounding. UI elements are small, visually ambiguous, and identified by their function rather than their appearance (a "Submit" button looks radically different across 100 different websites).
Accessibility Tree Grounding
Most operating systems and DOMs provide an accessibility tree—a structured HTML/XML representation of UI elements with their labels, roles, and bounding box coordinates. Early web agents bypassed vision entirely, feeding the HTML tree into an LLMLarge Language Model. However, many modern web applications (like canvas-based apps or poorly coded websites) have broken accessibility trees, necessitating pure visual grounding.
Set-of-Marks (SoM) Prompting
Predicting exact pixel coordinates using a VLMVision-Language Model is mathematically noisy and prone to regression errors. Set-of-Marks (SoM) prompting converts the coordinate regression problem into a multiple-choice classification problem.
- An external classical CV tool (like Grounding DINO or a UI-element detector) draws a bounding box around every actionable element on the screenshot.
- A numeric tag (e.g.,
[1],[2],[3]) is visually overlaid next to each box directly onto the image pixels. - The VLMVision-Language Model is fed this tagged image and prompted: "Which numbered element should be clicked to execute the search?"
- The VLMVision-Language Model simply generates the text token
[3]. The system parses this token and executes the physical click at the coordinates associated with tag 3.
This architectural hack drastically reduces grounding hallucination and improves reliability on complex interfaces.
Tool Use and API Calling
Multimodal agents extend beyond simple UI interaction by calling external tools: web search APIs, Python execution environments, calculator functions, and SQL database queries. Tool use enables agents to perform computations that VLMs cannot do reliably (like exact arithmetic) and to access real-time information beyond their pretraining cutoff.
Tool Representation: Each tool is described to the VLMVision-Language Model in the system prompt by its name, description, and parameter schema (e.g., JSON Schema). The VLMVision-Language Model selects tools by generating a specific formatting syntax:
{"tool": "web_search", "query": "current weather in Tokyo"}
A Python execution environment pauses the VLMVision-Language Model generation, intercepts this JSON, physically executes the API call, and appends the API's text output back into the VLMVision-Language Model's context window. The VLMVision-Language Model then resumes generation, reading the new data.
Tool Composition:
Complex tasks require calling multiple tools in a strict logical sequence, where the output of Tool A becomes the input of Tool B. An agent asked to "create a bar chart of the top 5 programming languages" must execute:
web_search() extract_data() python_execute(matplotlib) return_image().
Composing these tools correctly requires the VLMVision-Language Model to maintain a strict working memory of what state variables have been populated and what steps remain.
ReAct: Multimodal Chain-of-Thought
A VLMVision-Language Model forced to output a tool call immediately after reading an observation will frequently hallucinate. ReAct (Reason + ACTAction Chunking with Transformers; Yao et al., 2022) is an agentic framework that forces the VLMVision-Language Model to alternate between explicit reasoning steps and action steps.
The ReAct Cycle:
- Observation: The VLMVision-Language Model receives the screenshot and recent history.
- Thought (CoT): The VLMVision-Language Model generates an autoregressive chain-of-thought: "The user wants to book a flight to London. I see the destination input field marked with tag [4]. I need to click [4] and type 'London'."
- Action: The VLMVision-Language Model generates the structured tool call:
{"action": "type", "target": "[4]", "text": "London"} - Execution: The environment executes the typing and returns a new screenshot.
The "Thought" step forces the VLMVision-Language Model's attention heads to route context from the visual observation into the textual latent space before it commits to the rigid JSON action format. By explicitly grounding its intended action in a verbalized observation, ReAct drastically reduces the hallucination rate of API calls.
Implementation
import json
from dataclasses import dataclass
from typing import List, Dict, Optional, Callable
from abc import ABC, abstractmethod
@dataclass
class Observation:
"""An observation from the environment."""
screenshot: str # Image path or base64
history: List[Dict] # [{"role": "user/assistant", "content": ...}]
@dataclass
class Action:
"""An action selected by the agent."""
tool: str
parameters: Dict
thought: str # Reasoning before action
class Tool(ABC):
"""Base class for agent tools."""
@property
@abstractmethod
def name(self) -> str:
pass
@property
@abstractmethod
def description(self) -> str:
pass
@property
@abstractmethod
def schema(self) -> Dict:
"""JSON Schema for tool parameters."""
pass
@abstractmethod
def execute(self, **params) -> str:
"""Execute the tool and return result."""
pass
class WebSearchTool(Tool):
"""Web search tool for real-time information."""
@property
def name(self) -> str:
return "web_search"
@property
def description(self) -> str:
return "Search the web for information. Use for factual queries."
@property
def schema(self) -> Dict:
return {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
def execute(self, query: str) -> str:
# In production: call actual search API
return f"Search results for: {query}"
class CalculatorTool(Tool):
"""Mathematical calculator for exact arithmetic."""
@property
def name(self) -> str:
return "calculator"
@property
def description(self) -> str:
return "Perform exact mathematical calculations."
@property
def schema(self) -> Dict:
return {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression"}
},
"required": ["expression"]
}
def execute(self, expression: str) -> str:
# In production: use safe eval or math library
try:
result = eval(expression) # Use safer alternatives in production!
return str(result)
except:
return "Error: Invalid expression"
class VLMAgent:
"""
VLM-based agent implementing ReAct (Reason + Act) loop.
Args:
vlm: The vision-language model (e.g., LLaVA, GPT-4V)
tools: List of available tools
max_steps: Maximum ReAct steps before termination
"""
def __init__(self, vlm, tools: List[Tool], max_steps: int = 10):
self.vlm = vlm
self.tools = {t.name: t for t in tools}
self.max_steps = max_steps
def build_system_prompt(self) -> str:
"""Build the system prompt describing available tools."""
tools_desc = []
for tool in self.tools.values():
tools_desc.append(
f"- {tool.name}: {tool.description}\n"
f" Parameters: {json.dumps(tool.schema)}"
)
return (
"You are a multimodal agent. Given an observation and goal, you must:\n"
"1. THINK: Reason about what action to take\n"
"2. ACT: Call a tool using the specified JSON format\n\n"
"Available tools:\n" + "\n\n".join(tools_desc)
)
def parse_action(self, response: str) -> Optional[Action]:
"""Parse the model's response to extract thought and action."""
try:
# Look for JSON action block
if '{"tool"' in response or '{"action"' in response:
# Extract JSON from response
start = response.find('{')
end = response.rfind('}') + 1
json_str = response[start:end]
data = json.loads(json_str)
tool_name = data.get('tool') or data.get('action', 'unknown')
params = {k: v for k, v in data.items() if k not in ['tool', 'action']}
# Extract thought (text before JSON)
thought = response[:start].strip()
return Action(tool=tool_name, parameters=params, thought=thought)
except json.JSONDecodeError:
pass
return None
def run(self, observation: Observation, goal: str) -> List[Dict]:
"""
Run the ReAct loop.
Args:
observation: Current screenshot and history
goal: Task goal (e.g., "Book a flight to Tokyo")
Returns:
List of (thought, action, result) tuples
"""
trajectory = []
history = observation.history.copy()
for step in range(self.max_steps):
# Build prompt with current observation
prompt = self.build_system_prompt()
prompt += f"\n\nGoal: {goal}\n"
prompt += f"History:\n" + "\n".join(
f"{h['role']}: {h['content']}" for h in history[-5:]
)
prompt += "\n\nWhat should I do? Think and then call a tool."
# Get VLM response (would call actual API)
response = self.vlm.generate(prompt, observation.screenshot)
# Parse action
action = self.parse_action(response)
if not action:
trajectory.append({"step": step, "status": "parse_error"})
break
# Execute tool
if action.tool in self.tools:
result = self.tools[action.tool].execute(**action.parameters)
else:
result = f"Error: Unknown tool '{action.tool}'"
# Record in trajectory
trajectory.append({
"step": step,
"thought": action.thought,
"action": {action.tool: action.parameters},
"result": result
})
# Add to history
history.append({"role": "assistant", "content": response})
history.append({"role": "tool", "content": result})
# Check for termination
if action.tool == "finish" or "task complete" in result.lower():
break
return trajectory
# Example usage
class MockVLM:
"""Mock VLM for demonstration."""
def generate(self, prompt: str, image: str) -> str:
return json.dumps({"tool": "web_search", "query": "example"})
# Initialize agent with tools
tools = [WebSearchTool(), CalculatorTool()]
agent = VLMAgent(vlm=MockVLM(), tools=tools, max_steps=5)
# Run on an observation
obs = Observation(screenshot="screenshot.png", history=[])
trajectory = agent.run(obs, goal="Find the sum of 2+2")
for step in trajectory:
print(f"Step {step['step']}: {step}")
Key implementation patterns:
- Tool abstraction: Each tool implements
name,description,schema, andexecute() - ReAct parsing: Extract JSON action block from model's text response
- History management: Maintain conversation history for context
- Termination detection: Stop when agent signals completion
Production considerations:
- Use safer JSON parsing (handle malformed responses)
- Add retry logic for failed tool calls
- Implement token budget for context truncation
- Add structured output parsing for reliability
Key takeaways
Multimodal agents extend passive VLMs into the realm of sequential action by implementing a policy over a POMDPPartially Observable Markov Decision Process. UI grounding maps natural language intentions to precise interface coordinates, using techniques like Set-of-Marks (SoM) prompting to convert noisy pixel regression into highly reliable numeric token classification. Tool use allows agents to bypass the inherent limitations of transformer architectures by delegating arithmetic and real-time search to external APIs. The ReAct framework interleaves visual observation with explicit verbal reasoning and action generation, serving as the cognitive engine for multi-step task execution.
Conceptual questions
- POMDPPartially Observable Markov Decision Process Credit Assignment: A multimodal web agent is given the goal "Buy a red umbrella from Amazon." It takes 15 correct UI actions (searching, filtering by color, adding to cart), but on step 16, it clicks "Empty Cart" instead of "Checkout", failing the task. In a reinforcement learning setup where the only reward is for task success and otherwise, explain the "credit assignment problem" mathematically. How does the sparsity of the reward signal make training this VLMVision-Language Model policy via PPOProximal Policy Optimisation computationally intractable compared to Supervised Fine-Tuning (SFT) on human demonstrations?
- Set-of-Marks Context Limits: You design a Set-of-Marks (SoM) agent for a dense financial dashboard containing 400 interactive data cells. You overlay 400 numeric tags on the image. Describe two distinct failure modes that will occur: (1) A visual failure mode related to the resolution of the ViT patch tokens (), and (2) a semantic failure mode related to the VLMVision-Language Model's ability to maintain attention across 400 distinct visual regions. Propose a hierarchical SoM approach to bypass these limits.
- ReAct Hallucination Debugging: Your ReAct agent is tasked with scraping data from a table. The Thought trace outputs: "The table contains 5 rows. I will extract the data from row 1." However, the visual screenshot actually contains an empty table (0 rows). The agent then executes a Python script attempting to index
table[0]and crashes. Explain the psychological phenomenon of "LLMLarge Language Model sycophancy" or "hallucination cascade" occurring here. Once the text decoder generates the false premise "The table contains 5 rows", why do the subsequent self-attention layers mathematically lock into that false reality, overriding the true visual evidence? - Tool Composition Error Recovery: An agent must compose three tools:
search(),parse_pdf(), andemail_user(). Theparse_pdf()API goes offline and returns a standard HTTP 500 error string. Describe the exact text processing that must happen within the VLMVision-Language Model's context window to prevent the agent from blindly passing the HTTP 500 string into theemail_user()function. What specific instruction-tuning data must the VLMVision-Language Model have seen during Stage 2 pretraining to know how to execute a retry or fallback behavior? - Sim2Real for UI Agents: A VLMVision-Language Model is trained purely on screenshots of Mac OS interfaces. It achieves 95% accuracy on UI grounding. You deploy it on a Windows OS machine. The agent completely fails to find the "Close Window" button, because Mac puts the red "X" on the top left, while Windows puts it on the top right. Using the terminology of distribution shift (Week 9), is this a failure of covariate shift or spurious correlation? How does this demonstrate that UI agents suffer from the exact same "Sim2Real" embodiment gaps as physical robots?
Looking ahead
Multimodal agents demonstrate the power of VLMs as general perception-reasoning-action systems in the digital world. The ultimate, most demanding frontier of these capabilities is physical robotics—where visual perception must connect to real-world motor control under strict latency, embodiment, and safety constraints.
Week 12: Vision-Language Models for Robotics. We examine how VLMs are integrated into hierarchical robotic control architectures, how language-conditioned visual grounding connects instructions to manipulation targets, and how closed-loop VLMVision-Language Model-guided control operates in the physical world.
Further reading
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR. (The Reason + ACTAction Chunking with Transformers framework).
- Yang, J., et al. (2023). Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in Vision Models. arXiv. (The SoM technique for UI grounding).
- Zheng, S., et al. (2024). OSCopilot: Towards Generalist Computer Agents with Self-Improvement. ICLR.