Week 11: Multimodal Agents and Tool Use

Purpose of this lecture#

A VLM that can describe images and answer questions is fundamentally a passive perception system. An agent, however, uses those perceptual capabilities to take actions—clicking UI elements, calling external APIs, browsing the web, or controlling a robot—to autonomously complete multi-step tasks in dynamic environments.

The transition from a passive VLM to an active agent requires modeling decision-making under uncertainty, grounding abstract language intentions into actionable physical/digital affordances, and composing multiple discrete tools in service of a high-level goal. This lecture bridges the multimodal architectures developed over the last 10 weeks with the reinforcement learning theory from Course 1, examining how VLMs function as the policy engine within a Partially Observable Markov Decision Process (POMDP).

From Passive Model to Active Agent#

A VLM trained for perception answers the question: "Given this visual input, what is true?" An agent answers: "Given this visual input and a goal, what action should I take?" The distinction is not merely capability but feedback: an agent's actions change the state of the environment, which updates future visual observations, which subsequently influence future actions. This creates a sequential decision process rather than a one-shot prediction problem.

The Agent Loop: At each discrete time step $t$ :

The agent receives an observation $o_t$ (e.g., a screenshot of a webpage, or a robot sensor reading).
The agent takes an action $a_t \in \mathcal{A}$ (e.g., a mouse click coordinate, a keyboard input, an API call).
The environment transitions to a new hidden state $s_{t+1}$ and returns a new observation $o_{t+1}$ and optionally a reward $r_t$ .

Context as Working Memory: Unlike a single-step VLM query, an agent must maintain context across time. The agent's state representation at step $t$ consists of the overall goal $g$ , the history of past observations and actions $(o_1, a_1, \ldots, o_{t-1}, a_{t-1})$ , and the current observation $o_t$ . Modern multimodal agents simply pass this entire unrolled history into the VLM's context window, effectively using the LLM's Transformer context as its working memory.

The VLM as a POMDP Policy#

The agent's decision process is formally modeled as a Partially Observable Markov Decision Process (POMDP), connecting directly back to the foundational theory established in Course 1:

\langle \mathcal{S}, \mathcal{A}, T, R, \Omega, O, \gamma \rangle

where $\mathcal{S}$ is the unobserved true world state space, $\mathcal{A}$ is the action space, $T(s_{t+1} \mid s_t, a_t)$ is the transition dynamics, $R(s_t, a_t)$ is the reward function, $\Omega$ is the observation space (screenshots), $O(o_t \mid s_t)$ is the observation probability, and $\gamma$ is the discount factor.

The agent's policy $\pi$ maps the observation history and goal to a probability distribution over actions:

a_t \sim \pi_\theta(a \mid o_{1:t}, g)

For a VLM-based agent, the policy $\pi_\theta$ is physically implemented by the frozen (or fine-tuned) weights of the VLM. The context is formatted as a multimodal prompt, and the VLM autoregressively generates the next action as a structured text output (e.g., a JSON string describing the action type and parameters).

UI Grounding and Affordance Mapping#

UI grounding is the task of identifying and interacting with interface elements from a screenshot—clicking buttons, filling forms, navigating menus—based on a natural language instruction (e.g., "click the Submit button", "type 'search query' in the search field").

UI grounding is vastly more difficult than standard visual bounding-box grounding. UI elements are small, visually ambiguous, and identified by their function rather than their appearance (a "Submit" button looks radically different across 100 different websites).

Accessibility Tree Grounding#

Most operating systems and DOMs provide an accessibility tree—a structured HTML/XML representation of UI elements with their labels, roles, and bounding box coordinates. Early web agents bypassed vision entirely, feeding the HTML tree into an LLM. However, many modern web applications (like canvas-based apps or poorly coded websites) have broken accessibility trees, necessitating pure visual grounding.

Set-of-Marks (SoM) Prompting#

Predicting exact $(x, y)$ pixel coordinates using a VLM is mathematically noisy and prone to regression errors. Set-of-Marks (SoM) prompting converts the coordinate regression problem into a multiple-choice classification problem.

An external classical CV tool (like Grounding DINO or a UI-element detector) draws a bounding box around every actionable element on the screenshot.
A numeric tag (e.g., [1], [2], [3]) is visually overlaid next to each box directly onto the image pixels.
The VLM is fed this tagged image and prompted: "Which numbered element should be clicked to execute the search?"
The VLM simply generates the text token [3]. The system parses this token and executes the physical click at the coordinates associated with tag 3.

This architectural hack drastically reduces grounding hallucination and improves reliability on complex interfaces.

Tool Use and API Calling#

Multimodal agents extend beyond simple UI interaction by calling external tools: web search APIs, Python execution environments, calculator functions, and SQL database queries. Tool use enables agents to perform computations that VLMs cannot do reliably (like exact arithmetic) and to access real-time information beyond their pretraining cutoff.

Tool Representation: Each tool is described to the VLM in the system prompt by its name, description, and parameter schema (e.g., JSON Schema). The VLM selects tools by generating a specific formatting syntax:

json

{"tool": "web_search", "query": "current weather in Tokyo"}

A Python execution environment pauses the VLM generation, intercepts this JSON, physically executes the API call, and appends the API's text output back into the VLM's context window. The VLM then resumes generation, reading the new data.

Tool Composition: Complex tasks require calling multiple tools in a strict logical sequence, where the output of Tool A becomes the input of Tool B. An agent asked to "create a bar chart of the top 5 programming languages" must execute: web_search() $\to$ extract_data() $\to$ python_execute(matplotlib) $\to$ return_image(). Composing these tools correctly requires the VLM to maintain a strict working memory of what state variables have been populated and what steps remain.

ReAct: Multimodal Chain-of-Thought#

A VLM forced to output a tool call immediately after reading an observation will frequently hallucinate. ReAct (Reason + ACT; Yao et al., 2022) is an agentic framework that forces the VLM to alternate between explicit reasoning steps and action steps.

The ReAct Cycle:

Observation: The VLM receives the screenshot and recent history.
Thought (CoT): The VLM generates an autoregressive chain-of-thought: "The user wants to book a flight to London. I see the destination input field marked with tag [4]. I need to click [4] and type 'London'."
Action: The VLM generates the structured tool call: {"action": "type", "target": "[4]", "text": "London"}
Execution: The environment executes the typing and returns a new screenshot.

The "Thought" step forces the VLM's attention heads to route context from the visual observation into the textual latent space before it commits to the rigid JSON action format. By explicitly grounding its intended action in a verbalized observation, ReAct drastically reduces the hallucination rate of API calls.

Implementation#

python · runs in browser

import json
from dataclasses import dataclass
from typing import List, Dict, Optional, Callable
from abc import ABC, abstractmethod

@dataclass
class Observation:
    """An observation from the environment."""
    screenshot: str  # Image path or base64
    history: List[Dict]  # [{"role": "user/assistant", "content": ...}]

@dataclass
class Action:
    """An action selected by the agent."""
    tool: str
    parameters: Dict
    thought: str  # Reasoning before action

class Tool(ABC):
    """Base class for agent tools."""
    
    @property
    @abstractmethod
    def name(self) -> str:
        pass
    
    @property
    @abstractmethod
    def description(self) -> str:
        pass
    
    @property
    @abstractmethod
    def schema(self) -> Dict:
        """JSON Schema for tool parameters."""
        pass
    
    @abstractmethod
    def execute(self, **params) -> str:
        """Execute the tool and return result."""
        pass


class WebSearchTool(Tool):
    """Web search tool for real-time information."""
    
    @property
    def name(self) -> str:
        return "web_search"
    
    @property
    def description(self) -> str:
        return "Search the web for information. Use for factual queries."
    
    @property
    def schema(self) -> Dict:
        return {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    
    def execute(self, query: str) -> str:
        # In production: call actual search API
        return f"Search results for: {query}"


class CalculatorTool(Tool):
    """Mathematical calculator for exact arithmetic."""
    
    @property
    def name(self) -> str:
        return "calculator"
    
    @property
    def description(self) -> str:
        return "Perform exact mathematical calculations."
    
    @property
    def schema(self) -> Dict:
        return {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression"}
            },
            "required": ["expression"]
        }
    
    def execute(self, expression: str) -> str:
        # In production: use safe eval or math library
        try:
            result = eval(expression)  # Use safer alternatives in production!
            return str(result)
        except:
            return "Error: Invalid expression"


class VLMAgent:
    """
    VLM-based agent implementing ReAct (Reason + Act) loop.
    
    Args:
        vlm: The vision-language model (e.g., LLaVA, GPT-4V)
        tools: List of available tools
        max_steps: Maximum ReAct steps before termination
    """
    
    def __init__(self, vlm, tools: List[Tool], max_steps: int = 10):
        self.vlm = vlm
        self.tools = {t.name: t for t in tools}
        self.max_steps = max_steps
    
    def build_system_prompt(self) -> str:
        """Build the system prompt describing available tools."""
        tools_desc = []
        for tool in self.tools.values():
            tools_desc.append(
                f"- {tool.name}: {tool.description}\n"
                f"  Parameters: {json.dumps(tool.schema)}"
            )
        return (
            "You are a multimodal agent. Given an observation and goal, you must:\n"
            "1. THINK: Reason about what action to take\n"
            "2. ACT: Call a tool using the specified JSON format\n\n"
            "Available tools:\n" + "\n\n".join(tools_desc)
        )
    
    def parse_action(self, response: str) -> Optional[Action]:
        """Parse the model's response to extract thought and action."""
        try:
            # Look for JSON action block
            if '{"tool"' in response or '{"action"' in response:
                # Extract JSON from response
                start = response.find('{')
                end = response.rfind('}') + 1
                json_str = response[start:end]
                data = json.loads(json_str)
                
                tool_name = data.get('tool') or data.get('action', 'unknown')
                params = {k: v for k, v in data.items() if k not in ['tool', 'action']}
                
                # Extract thought (text before JSON)
                thought = response[:start].strip()
                
                return Action(tool=tool_name, parameters=params, thought=thought)
        except json.JSONDecodeError:
            pass
        return None
    
    def run(self, observation: Observation, goal: str) -> List[Dict]:
        """
        Run the ReAct loop.
        
        Args:
            observation: Current screenshot and history
            goal: Task goal (e.g., "Book a flight to Tokyo")
        
        Returns:
            List of (thought, action, result) tuples
        """
        trajectory = []
        history = observation.history.copy()
        
        for step in range(self.max_steps):
            # Build prompt with current observation
            prompt = self.build_system_prompt()
            prompt += f"\n\nGoal: {goal}\n"
            prompt += f"History:\n" + "\n".join(
                f"{h['role']}: {h['content']}" for h in history[-5:]
            )
            prompt += "\n\nWhat should I do? Think and then call a tool."
            
            # Get VLM response (would call actual API)
            response = self.vlm.generate(prompt, observation.screenshot)
            
            # Parse action
            action = self.parse_action(response)
            if not action:
                trajectory.append({"step": step, "status": "parse_error"})
                break
            
            # Execute tool
            if action.tool in self.tools:
                result = self.tools[action.tool].execute(**action.parameters)
            else:
                result = f"Error: Unknown tool '{action.tool}'"
            
            # Record in trajectory
            trajectory.append({
                "step": step,
                "thought": action.thought,
                "action": {action.tool: action.parameters},
                "result": result
            })
            
            # Add to history
            history.append({"role": "assistant", "content": response})
            history.append({"role": "tool", "content": result})
            
            # Check for termination
            if action.tool == "finish" or "task complete" in result.lower():
                break
        
        return trajectory


# Example usage
class MockVLM:
    """Mock VLM for demonstration."""
    def generate(self, prompt: str, image: str) -> str:
        return json.dumps({"tool": "web_search", "query": "example"})

# Initialize agent with tools
tools = [WebSearchTool(), CalculatorTool()]
agent = VLMAgent(vlm=MockVLM(), tools=tools, max_steps=5)

# Run on an observation
obs = Observation(screenshot="screenshot.png", history=[])
trajectory = agent.run(obs, goal="Find the sum of 2+2")

for step in trajectory:
    print(f"Step {step['step']}: {step}")

Key implementation patterns:

Tool abstraction: Each tool implements name, description, schema, and execute()
ReAct parsing: Extract JSON action block from model's text response
History management: Maintain conversation history for context
Termination detection: Stop when agent signals completion

Production considerations:

Use safer JSON parsing (handle malformed responses)
Add retry logic for failed tool calls
Implement token budget for context truncation
Add structured output parsing for reliability

Key takeaways#

Multimodal agents extend passive VLMs into the realm of sequential action by implementing a policy $\pi(a_t \mid o_{1:t}, g)$ over a POMDP. UI grounding maps natural language intentions to precise interface coordinates, using techniques like Set-of-Marks (SoM) prompting to convert noisy pixel regression into highly reliable numeric token classification. Tool use allows agents to bypass the inherent limitations of transformer architectures by delegating arithmetic and real-time search to external APIs. The ReAct framework interleaves visual observation with explicit verbal reasoning and action generation, serving as the cognitive engine for multi-step task execution.

Conceptual questions#

POMDP Credit Assignment: A multimodal web agent is given the goal "Buy a red umbrella from Amazon." It takes 15 correct UI actions (searching, filtering by color, adding to cart), but on step 16, it clicks "Empty Cart" instead of "Checkout", failing the task. In a reinforcement learning setup where the only reward is $r=+1$ for task success and $r=0$ otherwise, explain the "credit assignment problem" mathematically. How does the sparsity of the reward signal make training this VLM policy $\pi_\theta$ via PPO computationally intractable compared to Supervised Fine-Tuning (SFT) on human demonstrations?
Set-of-Marks Context Limits: You design a Set-of-Marks (SoM) agent for a dense financial dashboard containing 400 interactive data cells. You overlay 400 numeric tags on the image. Describe two distinct failure modes that will occur: (1) A visual failure mode related to the resolution of the ViT patch tokens ( $P \times P$ ), and (2) a semantic failure mode related to the VLM's ability to maintain attention across 400 distinct visual regions. Propose a hierarchical SoM approach to bypass these limits.
ReAct Hallucination Debugging: Your ReAct agent is tasked with scraping data from a table. The Thought trace outputs: "The table contains 5 rows. I will extract the data from row 1." However, the visual screenshot actually contains an empty table (0 rows). The agent then executes a Python script attempting to index table[0] and crashes. Explain the psychological phenomenon of "LLM sycophancy" or "hallucination cascade" occurring here. Once the text decoder generates the false premise "The table contains 5 rows", why do the subsequent self-attention layers mathematically lock into that false reality, overriding the true visual evidence?
Tool Composition Error Recovery: An agent must compose three tools: search(), parse_pdf(), and email_user(). The parse_pdf() API goes offline and returns a standard HTTP 500 error string. Describe the exact text processing that must happen within the VLM's context window to prevent the agent from blindly passing the HTTP 500 string into the email_user() function. What specific instruction-tuning data must the VLM have seen during Stage 2 pretraining to know how to execute a retry or fallback behavior?
Sim2Real for UI Agents: A VLM is trained purely on screenshots of Mac OS interfaces. It achieves 95% accuracy on UI grounding. You deploy it on a Windows OS machine. The agent completely fails to find the "Close Window" button, because Mac puts the red "X" on the top left, while Windows puts it on the top right. Using the terminology of distribution shift (Week 9), is this a failure of covariate shift or spurious correlation? How does this demonstrate that UI agents suffer from the exact same "Sim2Real" embodiment gaps as physical robots?

Solutions

Credit assignment. With reward only at success, one wrong action among sixteen zeroes the whole episode's return, and PPO must propagate that single scalar back across all steps with no per-step signal — high variance, needing enormous sample counts to discover which action mattered. SFT on human demos provides a dense correct-action label at every step, so it is far more sample-efficient.
Set-of-Marks limits. (1) Visual: 400 tiny tags exceed the ViT's $P \times P$ patch resolution, so tags blur and overlap and become unreadable. (2) Semantic: the VLM cannot hold reliable grounding across 400 distinct regions in one pass — attention dilutes and IDs get mismatched. Hierarchical SoM fixes both: segment the dashboard into a few regions, select one, then zoom in and re-mark only that region's cells.
Hallucination cascade. Once the decoder emits "5 rows," that false text enters the context and subsequent self-attention conditions on it as established fact (autoregressive models attend to their own outputs), so later tokens stay consistent with the false premise instead of rechecking the contradicting visual tokens. The strong textual prior it just created outweighs the visual evidence — confirmation lock-in.
Tool error recovery. The VLM must read the HTTP 500 string in context, classify it as an error rather than data, and branch to a retry or fallback instead of piping it into email_user(). This requires Stage-2 instruction-tuning data containing tool-error traces and recovery behavior (observe error → reason → retry/alternative), so the model learned to inspect tool outputs before chaining them.
UI sim2real. Training only on macOS then failing on Windows (close button left vs right) is covariate shift — the input distribution $P(x)$ changed while the task is unchanged — compounded by reliance on a spurious positional cue (close = top-left). It is the same embodiment/distribution gap robots face going sim→real: a policy overfit to training-environment statistics breaks under deployment shift.

Looking ahead#

Multimodal agents demonstrate the power of VLMs as general perception-reasoning-action systems in the digital world. The ultimate, most demanding frontier of these capabilities is physical robotics—where visual perception must connect to real-world motor control under strict latency, embodiment, and safety constraints.

Week 12: Vision-Language Models for Robotics. We examine how VLMs are integrated into hierarchical robotic control architectures, how language-conditioned visual grounding connects instructions to manipulation targets, and how closed-loop VLM-guided control operates in the physical world.

Purpose of this lecture#

From Passive Model to Active Agent#

The Agent Loop: At each discrete time step $t$ :

The agent receives an observation $o_t$ (e.g., a screenshot of a webpage, or a robot sensor reading).
The agent takes an action $a_t \in \mathcal{A}$ (e.g., a mouse click coordinate, a keyboard input, an API call).
The environment transitions to a new hidden state $s_{t+1}$ and returns a new observation $o_{t+1}$ and optionally a reward $r_t$ .

The VLM as a POMDP Policy#

The agent's decision process is formally modeled as a Partially Observable Markov Decision Process (POMDP), connecting directly back to the foundational theory established in Course 1:

\langle \mathcal{S}, \mathcal{A}, T, R, \Omega, O, \gamma \rangle

The agent's policy $\pi$ maps the observation history and goal to a probability distribution over actions:

a_t \sim \pi_\theta(a \mid o_{1:t}, g)

UI Grounding and Affordance Mapping#

Accessibility Tree Grounding#

Set-of-Marks (SoM) Prompting#

An external classical CV tool (like Grounding DINO or a UI-element detector) draws a bounding box around every actionable element on the screenshot.
A numeric tag (e.g., [1], [2], [3]) is visually overlaid next to each box directly onto the image pixels.
The VLM is fed this tagged image and prompted: "Which numbered element should be clicked to execute the search?"
The VLM simply generates the text token [3]. The system parses this token and executes the physical click at the coordinates associated with tag 3.

This architectural hack drastically reduces grounding hallucination and improves reliability on complex interfaces.

Tool Use and API Calling#

json

{"tool": "web_search", "query": "current weather in Tokyo"}

ReAct: Multimodal Chain-of-Thought#

The ReAct Cycle:

Observation: The VLM receives the screenshot and recent history.
Thought (CoT): The VLM generates an autoregressive chain-of-thought: "The user wants to book a flight to London. I see the destination input field marked with tag [4]. I need to click [4] and type 'London'."
Action: The VLM generates the structured tool call: {"action": "type", "target": "[4]", "text": "London"}
Execution: The environment executes the typing and returns a new screenshot.

Implementation#

python · runs in browser

import json
from dataclasses import dataclass
from typing import List, Dict, Optional, Callable
from abc import ABC, abstractmethod

@dataclass
class Observation:
    """An observation from the environment."""
    screenshot: str  # Image path or base64
    history: List[Dict]  # [{"role": "user/assistant", "content": ...}]

@dataclass
class Action:
    """An action selected by the agent."""
    tool: str
    parameters: Dict
    thought: str  # Reasoning before action

class Tool(ABC):
    """Base class for agent tools."""
    
    @property
    @abstractmethod
    def name(self) -> str:
        pass
    
    @property
    @abstractmethod
    def description(self) -> str:
        pass
    
    @property
    @abstractmethod
    def schema(self) -> Dict:
        """JSON Schema for tool parameters."""
        pass
    
    @abstractmethod
    def execute(self, **params) -> str:
        """Execute the tool and return result."""
        pass


class WebSearchTool(Tool):
    """Web search tool for real-time information."""
    
    @property
    def name(self) -> str:
        return "web_search"
    
    @property
    def description(self) -> str:
        return "Search the web for information. Use for factual queries."
    
    @property
    def schema(self) -> Dict:
        return {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    
    def execute(self, query: str) -> str:
        # In production: call actual search API
        return f"Search results for: {query}"


class CalculatorTool(Tool):
    """Mathematical calculator for exact arithmetic."""
    
    @property
    def name(self) -> str:
        return "calculator"
    
    @property
    def description(self) -> str:
        return "Perform exact mathematical calculations."
    
    @property
    def schema(self) -> Dict:
        return {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression"}
            },
            "required": ["expression"]
        }
    
    def execute(self, expression: str) -> str:
        # In production: use safe eval or math library
        try:
            result = eval(expression)  # Use safer alternatives in production!
            return str(result)
        except:
            return "Error: Invalid expression"


class VLMAgent:
    """
    VLM-based agent implementing ReAct (Reason + Act) loop.
    
    Args:
        vlm: The vision-language model (e.g., LLaVA, GPT-4V)
        tools: List of available tools
        max_steps: Maximum ReAct steps before termination
    """
    
    def __init__(self, vlm, tools: List[Tool], max_steps: int = 10):
        self.vlm = vlm
        self.tools = {t.name: t for t in tools}
        self.max_steps = max_steps
    
    def build_system_prompt(self) -> str:
        """Build the system prompt describing available tools."""
        tools_desc = []
        for tool in self.tools.values():
            tools_desc.append(
                f"- {tool.name}: {tool.description}\n"
                f"  Parameters: {json.dumps(tool.schema)}"
            )
        return (
            "You are a multimodal agent. Given an observation and goal, you must:\n"
            "1. THINK: Reason about what action to take\n"
            "2. ACT: Call a tool using the specified JSON format\n\n"
            "Available tools:\n" + "\n\n".join(tools_desc)
        )
    
    def parse_action(self, response: str) -> Optional[Action]:
        """Parse the model's response to extract thought and action."""
        try:
            # Look for JSON action block
            if '{"tool"' in response or '{"action"' in response:
                # Extract JSON from response
                start = response.find('{')
                end = response.rfind('}') + 1
                json_str = response[start:end]
                data = json.loads(json_str)
                
                tool_name = data.get('tool') or data.get('action', 'unknown')
                params = {k: v for k, v in data.items() if k not in ['tool', 'action']}
                
                # Extract thought (text before JSON)
                thought = response[:start].strip()
                
                return Action(tool=tool_name, parameters=params, thought=thought)
        except json.JSONDecodeError:
            pass
        return None
    
    def run(self, observation: Observation, goal: str) -> List[Dict]:
        """
        Run the ReAct loop.
        
        Args:
            observation: Current screenshot and history
            goal: Task goal (e.g., "Book a flight to Tokyo")
        
        Returns:
            List of (thought, action, result) tuples
        """
        trajectory = []
        history = observation.history.copy()
        
        for step in range(self.max_steps):
            # Build prompt with current observation
            prompt = self.build_system_prompt()
            prompt += f"\n\nGoal: {goal}\n"
            prompt += f"History:\n" + "\n".join(
                f"{h['role']}: {h['content']}" for h in history[-5:]
            )
            prompt += "\n\nWhat should I do? Think and then call a tool."
            
            # Get VLM response (would call actual API)
            response = self.vlm.generate(prompt, observation.screenshot)
            
            # Parse action
            action = self.parse_action(response)
            if not action:
                trajectory.append({"step": step, "status": "parse_error"})
                break
            
            # Execute tool
            if action.tool in self.tools:
                result = self.tools[action.tool].execute(**action.parameters)
            else:
                result = f"Error: Unknown tool '{action.tool}'"
            
            # Record in trajectory
            trajectory.append({
                "step": step,
                "thought": action.thought,
                "action": {action.tool: action.parameters},
                "result": result
            })
            
            # Add to history
            history.append({"role": "assistant", "content": response})
            history.append({"role": "tool", "content": result})
            
            # Check for termination
            if action.tool == "finish" or "task complete" in result.lower():
                break
        
        return trajectory


# Example usage
class MockVLM:
    """Mock VLM for demonstration."""
    def generate(self, prompt: str, image: str) -> str:
        return json.dumps({"tool": "web_search", "query": "example"})

# Initialize agent with tools
tools = [WebSearchTool(), CalculatorTool()]
agent = VLMAgent(vlm=MockVLM(), tools=tools, max_steps=5)

# Run on an observation
obs = Observation(screenshot="screenshot.png", history=[])
trajectory = agent.run(obs, goal="Find the sum of 2+2")

for step in trajectory:
    print(f"Step {step['step']}: {step}")

Key implementation patterns:

Tool abstraction: Each tool implements name, description, schema, and execute()
ReAct parsing: Extract JSON action block from model's text response
History management: Maintain conversation history for context
Termination detection: Stop when agent signals completion

Production considerations:

Use safer JSON parsing (handle malformed responses)
Add retry logic for failed tool calls
Implement token budget for context truncation
Add structured output parsing for reliability

Key takeaways#

Conceptual questions#

POMDP Credit Assignment: A multimodal web agent is given the goal "Buy a red umbrella from Amazon." It takes 15 correct UI actions (searching, filtering by color, adding to cart), but on step 16, it clicks "Empty Cart" instead of "Checkout", failing the task. In a reinforcement learning setup where the only reward is $r=+1$ for task success and $r=0$ otherwise, explain the "credit assignment problem" mathematically. How does the sparsity of the reward signal make training this VLM policy $\pi_\theta$ via PPO computationally intractable compared to Supervised Fine-Tuning (SFT) on human demonstrations?
Set-of-Marks Context Limits: You design a Set-of-Marks (SoM) agent for a dense financial dashboard containing 400 interactive data cells. You overlay 400 numeric tags on the image. Describe two distinct failure modes that will occur: (1) A visual failure mode related to the resolution of the ViT patch tokens ( $P \times P$ ), and (2) a semantic failure mode related to the VLM's ability to maintain attention across 400 distinct visual regions. Propose a hierarchical SoM approach to bypass these limits.
ReAct Hallucination Debugging: Your ReAct agent is tasked with scraping data from a table. The Thought trace outputs: "The table contains 5 rows. I will extract the data from row 1." However, the visual screenshot actually contains an empty table (0 rows). The agent then executes a Python script attempting to index table[0] and crashes. Explain the psychological phenomenon of "LLM sycophancy" or "hallucination cascade" occurring here. Once the text decoder generates the false premise "The table contains 5 rows", why do the subsequent self-attention layers mathematically lock into that false reality, overriding the true visual evidence?
Tool Composition Error Recovery: An agent must compose three tools: search(), parse_pdf(), and email_user(). The parse_pdf() API goes offline and returns a standard HTTP 500 error string. Describe the exact text processing that must happen within the VLM's context window to prevent the agent from blindly passing the HTTP 500 string into the email_user() function. What specific instruction-tuning data must the VLM have seen during Stage 2 pretraining to know how to execute a retry or fallback behavior?
Sim2Real for UI Agents: A VLM is trained purely on screenshots of Mac OS interfaces. It achieves 95% accuracy on UI grounding. You deploy it on a Windows OS machine. The agent completely fails to find the "Close Window" button, because Mac puts the red "X" on the top left, while Windows puts it on the top right. Using the terminology of distribution shift (Week 9), is this a failure of covariate shift or spurious correlation? How does this demonstrate that UI agents suffer from the exact same "Sim2Real" embodiment gaps as physical robots?

Solutions

Credit assignment. With reward only at success, one wrong action among sixteen zeroes the whole episode's return, and PPO must propagate that single scalar back across all steps with no per-step signal — high variance, needing enormous sample counts to discover which action mattered. SFT on human demos provides a dense correct-action label at every step, so it is far more sample-efficient.
Set-of-Marks limits. (1) Visual: 400 tiny tags exceed the ViT's $P \times P$ patch resolution, so tags blur and overlap and become unreadable. (2) Semantic: the VLM cannot hold reliable grounding across 400 distinct regions in one pass — attention dilutes and IDs get mismatched. Hierarchical SoM fixes both: segment the dashboard into a few regions, select one, then zoom in and re-mark only that region's cells.
Hallucination cascade. Once the decoder emits "5 rows," that false text enters the context and subsequent self-attention conditions on it as established fact (autoregressive models attend to their own outputs), so later tokens stay consistent with the false premise instead of rechecking the contradicting visual tokens. The strong textual prior it just created outweighs the visual evidence — confirmation lock-in.
Tool error recovery. The VLM must read the HTTP 500 string in context, classify it as an error rather than data, and branch to a retry or fallback instead of piping it into email_user(). This requires Stage-2 instruction-tuning data containing tool-error traces and recovery behavior (observe error → reason → retry/alternative), so the model learned to inspect tool outputs before chaining them.
UI sim2real. Training only on macOS then failing on Windows (close button left vs right) is covariate shift — the input distribution $P(x)$ changed while the task is unchanged — compounded by reliance on a spurious positional cue (close = top-left). It is the same embodiment/distribution gap robots face going sim→real: a policy overfit to training-environment statistics breaks under deployment shift.

Purpose of this lecture#

From Passive Model to Active Agent#

The VLM as a POMDP Policy#

UI Grounding and Affordance Mapping#

Accessibility Tree Grounding#

Set-of-Marks (SoM) Prompting#

Tool Use and API Calling#

ReAct: Multimodal Chain-of-Thought#

Implementation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 11: Multimodal Agents and Tool Use

Purpose of this lecture#

From Passive Model to Active Agent#

The VLM as a POMDP Policy#

UI Grounding and Affordance Mapping#

Accessibility Tree Grounding#

Set-of-Marks (SoM) Prompting#

Tool Use and API Calling#

ReAct: Multimodal Chain-of-Thought#

Implementation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 11: Multimodal Agents and Tool Use

Purpose of this lecture#

From Passive Model to Active Agent#

The VLMVision-Language Model as a POMDPPartially Observable Markov Decision Process Policy#

UI Grounding and Affordance Mapping#

Accessibility Tree Grounding#

Set-of-Marks (SoM) Prompting#

Tool Use and API Calling#

ReAct: Multimodal Chain-of-Thought#

Implementation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 11: Multimodal Agents and Tool Use

Purpose of this lecture#

From Passive Model to Active Agent#

The VLMVision-Language Model as a POMDPPartially Observable Markov Decision Process Policy#

UI Grounding and Affordance Mapping#

Accessibility Tree Grounding#

Set-of-Marks (SoM) Prompting#

Tool Use and API Calling#

ReAct: Multimodal Chain-of-Thought#

Implementation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

The VLM as a POMDP Policy#

The VLM as a POMDP Policy#