Skip to main content
illumin8
Courses
Week 11: Multimodal Agents and Tool Use
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 11

Week 11: Multimodal Agents and Tool Use

✦Learning Outcomes
  • Implement UI grounding and web navigation with VLMs
  • Apply tool use and API calling with VLMs
  • Connect VLMVision-Language Model agents to RLReinforcement Learning concepts from Course 1
◆Prerequisites
  • Week 6: LLaVA - Conversational VLMVision-Language Model foundations
  • Course 1 (RLReinforcement Learning) background helpful for POMDPPartially Observable Markov Decision Process concepts

Purpose of this lecture

A VLMVision-Language Model that can describe images and answer questions is fundamentally a passive perception system. An agent, however, uses those perceptual capabilities to take actions—clicking UI elements, calling external APIs, browsing the web, or controlling a robot—to autonomously complete multi-step tasks in dynamic environments.

The transition from a passive VLMVision-Language Model to an active agent requires modeling decision-making under uncertainty, grounding abstract language intentions into actionable physical/digital affordances, and composing multiple discrete tools in service of a high-level goal. This lecture bridges the multimodal architectures developed over the last 10 weeks with the reinforcement learning theory from Course 1, examining how VLMs function as the policy engine within a Partially Observable Markov Decision Process (POMDPPartially Observable Markov Decision Process).


From Passive Model to Active Agent

A VLMVision-Language Model trained for perception answers the question: "Given this visual input, what is true?" An agent answers: "Given this visual input and a goal, what action should I take?" The distinction is not merely capability but feedback: an agent's actions change the state of the environment, which updates future visual observations, which subsequently influence future actions. This creates a sequential decision process rather than a one-shot prediction problem.

The Agent Loop: At each discrete time step ttt:

  1. The agent receives an observation oto_tot​ (e.g., a screenshot of a webpage, or a robot sensor reading).
  2. The agent takes an action at∈Aa_t \in \mathcal{A}at​∈A (e.g., a mouse click coordinate, a keyboard input, an API call).
  3. The environment transitions to a new hidden state st+1s_{t+1}st+1​ and returns a new observation ot+1o_{t+1}ot+1​ and optionally a reward rtr_trt​.

Context as Working Memory: Unlike a single-step VLMVision-Language Model query, an agent must maintain context across time. The agent's state representation at step ttt consists of the overall goal ggg, the history of past observations and actions (o1,a1,…,ot−1,at−1)(o_1, a_1, \ldots, o_{t-1}, a_{t-1})(o1​,a1​,…,ot−1​,at−1​), and the current observation oto_tot​. Modern multimodal agents simply pass this entire unrolled history into the VLMVision-Language Model's context window, effectively using the LLMLarge Language Model's Transformer context as its working memory.


The VLMVision-Language Model as a POMDPPartially Observable Markov Decision Process Policy

The agent's decision process is formally modeled as a Partially Observable Markov Decision Process (POMDPPartially Observable Markov Decision Process), connecting directly back to the foundational theory established in Course 1:

⟨S,A,T,R,Ω,O,γ⟩\langle \mathcal{S}, \mathcal{A}, T, R, \Omega, O, \gamma \rangle⟨S,A,T,R,Ω,O,γ⟩

where S\mathcal{S}S is the unobserved true world state space, A\mathcal{A}A is the action space, T(st+1∣st,at)T(s_{t+1} \mid s_t, a_t)T(st+1​∣st​,at​) is the transition dynamics, R(st,at)R(s_t, a_t)R(st​,at​) is the reward function, Ω\OmegaΩ is the observation space (screenshots), O(ot∣st)O(o_t \mid s_t)O(ot​∣st​) is the observation probability, and γ\gammaγ is the discount factor.

The agent's policy π\piπ maps the observation history and goal to a probability distribution over actions:

at∼πθ(a∣o1:t,g)a_t \sim \pi_\theta(a \mid o_{1:t}, g)at​∼πθ​(a∣o1:t​,g)

For a VLMVision-Language Model-based agent, the policy πθ\pi_\thetaπθ​ is physically implemented by the frozen (or fine-tuned) weights of the VLMVision-Language Model. The context is formatted as a multimodal prompt, and the VLMVision-Language Model autoregressively generates the next action as a structured text output (e.g., a JSON string describing the action type and parameters).


UI Grounding and Affordance Mapping

UI grounding is the task of identifying and interacting with interface elements from a screenshot—clicking buttons, filling forms, navigating menus—based on a natural language instruction (e.g., "click the Submit button", "type 'search query' in the search field").

UI grounding is vastly more difficult than standard visual bounding-box grounding. UI elements are small, visually ambiguous, and identified by their function rather than their appearance (a "Submit" button looks radically different across 100 different websites).

Accessibility Tree Grounding

Most operating systems and DOMs provide an accessibility tree—a structured HTML/XML representation of UI elements with their labels, roles, and bounding box coordinates. Early web agents bypassed vision entirely, feeding the HTML tree into an LLMLarge Language Model. However, many modern web applications (like canvas-based apps or poorly coded websites) have broken accessibility trees, necessitating pure visual grounding.

Set-of-Marks (SoM) Prompting

Predicting exact (x,y)(x, y)(x,y) pixel coordinates using a VLMVision-Language Model is mathematically noisy and prone to regression errors. Set-of-Marks (SoM) prompting converts the coordinate regression problem into a multiple-choice classification problem.

  1. An external classical CV tool (like Grounding DINO or a UI-element detector) draws a bounding box around every actionable element on the screenshot.
  2. A numeric tag (e.g., [1], [2], [3]) is visually overlaid next to each box directly onto the image pixels.
  3. The VLMVision-Language Model is fed this tagged image and prompted: "Which numbered element should be clicked to execute the search?"
  4. The VLMVision-Language Model simply generates the text token [3]. The system parses this token and executes the physical click at the coordinates associated with tag 3.

This architectural hack drastically reduces grounding hallucination and improves reliability on complex interfaces.


Tool Use and API Calling

Multimodal agents extend beyond simple UI interaction by calling external tools: web search APIs, Python execution environments, calculator functions, and SQL database queries. Tool use enables agents to perform computations that VLMs cannot do reliably (like exact arithmetic) and to access real-time information beyond their pretraining cutoff.

Tool Representation: Each tool is described to the VLMVision-Language Model in the system prompt by its name, description, and parameter schema (e.g., JSON Schema). The VLMVision-Language Model selects tools by generating a specific formatting syntax:

{"tool": "web_search", "query": "current weather in Tokyo"}

A Python execution environment pauses the VLMVision-Language Model generation, intercepts this JSON, physically executes the API call, and appends the API's text output back into the VLMVision-Language Model's context window. The VLMVision-Language Model then resumes generation, reading the new data.

Tool Composition: Complex tasks require calling multiple tools in a strict logical sequence, where the output of Tool A becomes the input of Tool B. An agent asked to "create a bar chart of the top 5 programming languages" must execute: web_search() →\to→ extract_data() →\to→ python_execute(matplotlib) →\to→ return_image(). Composing these tools correctly requires the VLMVision-Language Model to maintain a strict working memory of what state variables have been populated and what steps remain.


ReAct: Multimodal Chain-of-Thought

A VLMVision-Language Model forced to output a tool call immediately after reading an observation will frequently hallucinate. ReAct (Reason + ACTAction Chunking with Transformers; Yao et al., 2022) is an agentic framework that forces the VLMVision-Language Model to alternate between explicit reasoning steps and action steps.

The ReAct Cycle:

  1. Observation: The VLMVision-Language Model receives the screenshot and recent history.
  2. Thought (CoT): The VLMVision-Language Model generates an autoregressive chain-of-thought: "The user wants to book a flight to London. I see the destination input field marked with tag [4]. I need to click [4] and type 'London'."
  3. Action: The VLMVision-Language Model generates the structured tool call: {"action": "type", "target": "[4]", "text": "London"}
  4. Execution: The environment executes the typing and returns a new screenshot.

The "Thought" step forces the VLMVision-Language Model's attention heads to route context from the visual observation into the textual latent space before it commits to the rigid JSON action format. By explicitly grounding its intended action in a verbalized observation, ReAct drastically reduces the hallucination rate of API calls.

Implementation

import json
from dataclasses import dataclass
from typing import List, Dict, Optional, Callable
from abc import ABC, abstractmethod

@dataclass
class Observation:
    """An observation from the environment."""
    screenshot: str  # Image path or base64
    history: List[Dict]  # [{"role": "user/assistant", "content": ...}]

@dataclass
class Action:
    """An action selected by the agent."""
    tool: str
    parameters: Dict
    thought: str  # Reasoning before action

class Tool(ABC):
    """Base class for agent tools."""
    
    @property
    @abstractmethod
    def name(self) -> str:
        pass
    
    @property
    @abstractmethod
    def description(self) -> str:
        pass
    
    @property
    @abstractmethod
    def schema(self) -> Dict:
        """JSON Schema for tool parameters."""
        pass
    
    @abstractmethod
    def execute(self, **params) -> str:
        """Execute the tool and return result."""
        pass


class WebSearchTool(Tool):
    """Web search tool for real-time information."""
    
    @property
    def name(self) -> str:
        return "web_search"
    
    @property
    def description(self) -> str:
        return "Search the web for information. Use for factual queries."
    
    @property
    def schema(self) -> Dict:
        return {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    
    def execute(self, query: str) -> str:
        # In production: call actual search API
        return f"Search results for: {query}"


class CalculatorTool(Tool):
    """Mathematical calculator for exact arithmetic."""
    
    @property
    def name(self) -> str:
        return "calculator"
    
    @property
    def description(self) -> str:
        return "Perform exact mathematical calculations."
    
    @property
    def schema(self) -> Dict:
        return {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression"}
            },
            "required": ["expression"]
        }
    
    def execute(self, expression: str) -> str:
        # In production: use safe eval or math library
        try:
            result = eval(expression)  # Use safer alternatives in production!
            return str(result)
        except:
            return "Error: Invalid expression"


class VLMAgent:
    """
    VLM-based agent implementing ReAct (Reason + Act) loop.
    
    Args:
        vlm: The vision-language model (e.g., LLaVA, GPT-4V)
        tools: List of available tools
        max_steps: Maximum ReAct steps before termination
    """
    
    def __init__(self, vlm, tools: List[Tool], max_steps: int = 10):
        self.vlm = vlm
        self.tools = {t.name: t for t in tools}
        self.max_steps = max_steps
    
    def build_system_prompt(self) -> str:
        """Build the system prompt describing available tools."""
        tools_desc = []
        for tool in self.tools.values():
            tools_desc.append(
                f"- {tool.name}: {tool.description}\n"
                f"  Parameters: {json.dumps(tool.schema)}"
            )
        return (
            "You are a multimodal agent. Given an observation and goal, you must:\n"
            "1. THINK: Reason about what action to take\n"
            "2. ACT: Call a tool using the specified JSON format\n\n"
            "Available tools:\n" + "\n\n".join(tools_desc)
        )
    
    def parse_action(self, response: str) -> Optional[Action]:
        """Parse the model's response to extract thought and action."""
        try:
            # Look for JSON action block
            if '{"tool"' in response or '{"action"' in response:
                # Extract JSON from response
                start = response.find('{')
                end = response.rfind('}') + 1
                json_str = response[start:end]
                data = json.loads(json_str)
                
                tool_name = data.get('tool') or data.get('action', 'unknown')
                params = {k: v for k, v in data.items() if k not in ['tool', 'action']}
                
                # Extract thought (text before JSON)
                thought = response[:start].strip()
                
                return Action(tool=tool_name, parameters=params, thought=thought)
        except json.JSONDecodeError:
            pass
        return None
    
    def run(self, observation: Observation, goal: str) -> List[Dict]:
        """
        Run the ReAct loop.
        
        Args:
            observation: Current screenshot and history
            goal: Task goal (e.g., "Book a flight to Tokyo")
        
        Returns:
            List of (thought, action, result) tuples
        """
        trajectory = []
        history = observation.history.copy()
        
        for step in range(self.max_steps):
            # Build prompt with current observation
            prompt = self.build_system_prompt()
            prompt += f"\n\nGoal: {goal}\n"
            prompt += f"History:\n" + "\n".join(
                f"{h['role']}: {h['content']}" for h in history[-5:]
            )
            prompt += "\n\nWhat should I do? Think and then call a tool."
            
            # Get VLM response (would call actual API)
            response = self.vlm.generate(prompt, observation.screenshot)
            
            # Parse action
            action = self.parse_action(response)
            if not action:
                trajectory.append({"step": step, "status": "parse_error"})
                break
            
            # Execute tool
            if action.tool in self.tools:
                result = self.tools[action.tool].execute(**action.parameters)
            else:
                result = f"Error: Unknown tool '{action.tool}'"
            
            # Record in trajectory
            trajectory.append({
                "step": step,
                "thought": action.thought,
                "action": {action.tool: action.parameters},
                "result": result
            })
            
            # Add to history
            history.append({"role": "assistant", "content": response})
            history.append({"role": "tool", "content": result})
            
            # Check for termination
            if action.tool == "finish" or "task complete" in result.lower():
                break
        
        return trajectory


# Example usage
class MockVLM:
    """Mock VLM for demonstration."""
    def generate(self, prompt: str, image: str) -> str:
        return json.dumps({"tool": "web_search", "query": "example"})

# Initialize agent with tools
tools = [WebSearchTool(), CalculatorTool()]
agent = VLMAgent(vlm=MockVLM(), tools=tools, max_steps=5)

# Run on an observation
obs = Observation(screenshot="screenshot.png", history=[])
trajectory = agent.run(obs, goal="Find the sum of 2+2")

for step in trajectory:
    print(f"Step {step['step']}: {step}")

Key implementation patterns:

  1. Tool abstraction: Each tool implements name, description, schema, and execute()
  2. ReAct parsing: Extract JSON action block from model's text response
  3. History management: Maintain conversation history for context
  4. Termination detection: Stop when agent signals completion

Production considerations:

  • Use safer JSON parsing (handle malformed responses)
  • Add retry logic for failed tool calls
  • Implement token budget for context truncation
  • Add structured output parsing for reliability

Key takeaways

Multimodal agents extend passive VLMs into the realm of sequential action by implementing a policy π(at∣o1:t,g)\pi(a_t \mid o_{1:t}, g)π(at​∣o1:t​,g) over a POMDPPartially Observable Markov Decision Process. UI grounding maps natural language intentions to precise interface coordinates, using techniques like Set-of-Marks (SoM) prompting to convert noisy pixel regression into highly reliable numeric token classification. Tool use allows agents to bypass the inherent limitations of transformer architectures by delegating arithmetic and real-time search to external APIs. The ReAct framework interleaves visual observation with explicit verbal reasoning and action generation, serving as the cognitive engine for multi-step task execution.


Conceptual questions

  1. POMDPPartially Observable Markov Decision Process Credit Assignment: A multimodal web agent is given the goal "Buy a red umbrella from Amazon." It takes 15 correct UI actions (searching, filtering by color, adding to cart), but on step 16, it clicks "Empty Cart" instead of "Checkout", failing the task. In a reinforcement learning setup where the only reward is r=+1r=+1r=+1 for task success and r=0r=0r=0 otherwise, explain the "credit assignment problem" mathematically. How does the sparsity of the reward signal make training this VLMVision-Language Model policy πθ\pi_\thetaπθ​ via PPOProximal Policy Optimisation computationally intractable compared to Supervised Fine-Tuning (SFT) on human demonstrations?
  2. Set-of-Marks Context Limits: You design a Set-of-Marks (SoM) agent for a dense financial dashboard containing 400 interactive data cells. You overlay 400 numeric tags on the image. Describe two distinct failure modes that will occur: (1) A visual failure mode related to the resolution of the ViT patch tokens (P×PP \times PP×P), and (2) a semantic failure mode related to the VLMVision-Language Model's ability to maintain attention across 400 distinct visual regions. Propose a hierarchical SoM approach to bypass these limits.
  3. ReAct Hallucination Debugging: Your ReAct agent is tasked with scraping data from a table. The Thought trace outputs: "The table contains 5 rows. I will extract the data from row 1." However, the visual screenshot actually contains an empty table (0 rows). The agent then executes a Python script attempting to index table[0] and crashes. Explain the psychological phenomenon of "LLMLarge Language Model sycophancy" or "hallucination cascade" occurring here. Once the text decoder generates the false premise "The table contains 5 rows", why do the subsequent self-attention layers mathematically lock into that false reality, overriding the true visual evidence?
  4. Tool Composition Error Recovery: An agent must compose three tools: search(), parse_pdf(), and email_user(). The parse_pdf() API goes offline and returns a standard HTTP 500 error string. Describe the exact text processing that must happen within the VLMVision-Language Model's context window to prevent the agent from blindly passing the HTTP 500 string into the email_user() function. What specific instruction-tuning data must the VLMVision-Language Model have seen during Stage 2 pretraining to know how to execute a retry or fallback behavior?
  5. Sim2Real for UI Agents: A VLMVision-Language Model is trained purely on screenshots of Mac OS interfaces. It achieves 95% accuracy on UI grounding. You deploy it on a Windows OS machine. The agent completely fails to find the "Close Window" button, because Mac puts the red "X" on the top left, while Windows puts it on the top right. Using the terminology of distribution shift (Week 9), is this a failure of covariate shift or spurious correlation? How does this demonstrate that UI agents suffer from the exact same "Sim2Real" embodiment gaps as physical robots?
✦Solutions
  1. Credit assignment. With reward only at success, one wrong action among sixteen zeroes the whole episode's return, and PPO must propagate that single scalar back across all steps with no per-step signal — high variance, needing enormous sample counts to discover which action mattered. SFT on human demos provides a dense correct-action label at every step, so it is far more sample-efficient.
  2. Set-of-Marks limits. (1) Visual: 400 tiny tags exceed the ViT's P×PP \times PP×P patch resolution, so tags blur and overlap and become unreadable. (2) Semantic: the VLM cannot hold reliable grounding across 400 distinct regions in one pass — attention dilutes and IDs get mismatched. Hierarchical SoM fixes both: segment the dashboard into a few regions, select one, then zoom in and re-mark only that region's cells.
  3. Hallucination cascade. Once the decoder emits "5 rows," that false text enters the context and subsequent self-attention conditions on it as established fact (autoregressive models attend to their own outputs), so later tokens stay consistent with the false premise instead of rechecking the contradicting visual tokens. The strong textual prior it just created outweighs the visual evidence — confirmation lock-in.
  4. Tool error recovery. The VLM must read the HTTP 500 string in context, classify it as an error rather than data, and branch to a retry or fallback instead of piping it into email_user(). This requires Stage-2 instruction-tuning data containing tool-error traces and recovery behavior (observe error → reason → retry/alternative), so the model learned to inspect tool outputs before chaining them.
  5. UI sim2real. Training only on macOS then failing on Windows (close button left vs right) is covariate shift — the input distribution P(x)P(x)P(x) changed while the task is unchanged — compounded by reliance on a spurious positional cue (close = top-left). It is the same embodiment/distribution gap robots face going sim→real: a policy overfit to training-environment statistics breaks under deployment shift.

Looking ahead

Multimodal agents demonstrate the power of VLMs as general perception-reasoning-action systems in the digital world. The ultimate, most demanding frontier of these capabilities is physical robotics—where visual perception must connect to real-world motor control under strict latency, embodiment, and safety constraints.

Week 12: Vision-Language Models for Robotics. We examine how VLMs are integrated into hierarchical robotic control architectures, how language-conditioned visual grounding connects instructions to manipulation targets, and how closed-loop VLMVision-Language Model-guided control operates in the physical world.


Further reading

  • Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR. (The Reason + ACTAction Chunking with Transformers framework).
  • Yang, J., et al. (2023). Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in Vision Models. arXiv. (The SoM technique for UI grounding).
  • Zheng, S., et al. (2024). OSCopilot: Towards Generalist Computer Agents with Self-Improvement. ICLR.
← Previous
Week 10: ControlNet and Controlled Generation
Next →
Week 12: Vision-Language Models for Robotics
On this page
  • Purpose of this lecture
  • From Passive Model to Active Agent
  • The VLM as a POMDP Policy
  • UI Grounding and Affordance Mapping
  • Accessibility Tree Grounding
  • Set-of-Marks (SoM) Prompting
  • Tool Use and API Calling
  • ReAct: Multimodal Chain-of-Thought
  • Implementation
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading