Skip to main content
illumin8
Courses
Week 14: Vision-Language Capstone
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 14

Week 14: Vision-Language Capstone

✦Learning Outcomes
  • Complete a domain-specific fine-tuning or robotics capstone project
  • Integrate architectural choices with real-world constraints
  • Synthesize Course 4 content into deployable multimodal systems
◆Prerequisites
  • Week 1-6 - Foundational VLMVision-Language Model architectures (ViT, CLIP, BLIP, LLaVA)
  • Week 8: Fine-Tuning - LoRA and parameter-efficient methods
  • Week 12: Robotics - VLMVision-Language Model for robotics (Track B)

This chapter integrates concepts from all previous weeks into a coherent system design methodology.

The preceding thirteen weeks established the theoretical and structural foundations of Vision-Language Models: from ViT perception and contrastive pretraining, through generative architectures like LLaVA and Flamingo, to the physical embodiment of VLMs as robotic agents governed by strict safety and alignment protocols.

The capstone synthesizes these isolated concepts into a complete, end-to-end practitioner methodology. In physical AI, combining a perfect vision encoder with a perfect language model does not guarantee a functioning system; the architectural interfaces, the data curation, the evaluation rigor, and the hardware latency budgets dictate success. Two detailed case study tracks—Domain-Specific Fine-Tuning (Track A) and Embodied System Design (Track B)—ground this methodology in concrete engineering decisions, culminating in a rigorous framework for building deployable multimodal systems.

The VLMVision-Language Model System Design Process

Designing a VLMVision-Language Model system for a real-world application requires a strict sequence of decisions. Errors at step 1 geometrically compound by step 5.

Step 1: Task Formulation & Constraint Mapping. Define the task precisely in terms of inputs, outputs, and physical constraints. "Build a VLMVision-Language Model for autonomous driving" is an invalid formulation. "Classify traffic light states from 1080p dashboard video at 30Hz with ≤15ms\leq 15\text{ms}≤15ms latency and ≥99.99%\geq 99.99\%≥99.99% True Positive Rate" is valid. The latency constraint immediately rules out autoregressive text generation (LLaVA), pointing instead to a fast Perceiver-style encoder or a direct linear probe on a frozen CLIP backbone.

Step 2: Data Inventory & Modality Matching. Catalog the available data relative to the target task complexity. For a domain with 500 labeled examples, full fine-tuning of a 7B model is mathematically impossible without catastrophic forgetting; LoRA with heavy pre-training regularization is required. Furthermore, match the modalities: if the task requires predicting continuous mechanical forces, the VLMVision-Language Model architecture must be augmented to process 1D force-torque sequences (Week 7) alongside RGB images.

Step 3: Architecture Selection. Apply the architectural decision framework:

  • Does the task require sub-centimeter physical grounding? Use an MAE/DINOv2 vision backbone, not CLIP.
  • Does it require multi-camera video support without quadratic context explosion? Use Flamingo's depth-distributed gated cross-attention.
  • Does it require generating complex reasoning chains? Use an instruction-tuned LLMLarge Language Model backbone with ReAct prompting.

Step 4: Pretraining and Fine-Tuning Strategy. Select the adaptation method. Projector-only adaptation works for pure visual domain shifts; LLMLarge Language Model LoRA works for adapting the reasoning style; QLoRA is required if compute is bottlenecked to a single GPU.

Step 5: Adversarial Evaluation Design. Define the evaluation suite before training begins. Relying purely on in-distribution metrics like CIDEr or standard VQA accuracy guarantees a Sim2Real reality gap. Construct explicit contrast sets to test spurious correlations and compositional failures (Week 9).


Track A: Domain-Specific VLMVision-Language Model Fine-Tuning

Target Task: Fine-tune a LLaVA-1.5-7B model to generate structured radiology reports for chest X-rays. Input: A chest X-ray image and a clinical context prompt ("Patient is a 65-year-old male with chronic cough"). Output: A structured report with sections for Findings, Impression, and Recommendations.

System Architecture & Training Recipe

Medical imagery is a severe visual domain shift from the natural images (ImageNet/COCO) used to pretrain CLIP. Therefore, the visual features must be explicitly adapted.

  1. Stage 1 (Alignment): Unfreeze the MLP projector and the CLIP ViT-L encoder using a small LoRA rank (r=8r=8r=8). Train on 100K simple (X-ray, one-sentence finding) pairs. This corrects the visual domain shift, forcing the ViT to represent medical anomalies rather than ignoring them as noise.
  2. Stage 2 (Instruction Tuning): Freeze the ViT. Apply QLoRA (r=32r=32r=32) to the LLMLarge Language Model's attention matrices (WQ,WVW_Q, W_VWQ​,WV​). Fine-tune on full report generation using clinical context prompts. To prevent catastrophic forgetting of general language capabilities, use Data Mixing: inject 5% standard conversational data into the batch.

Ablation Study Design

A rigorous ablation study isolates the mathematical contribution of each component.

| Condition | ViT LoRA | Projector | LLMLarge Language Model LoRA | Expected Metric Impact | |---|---|---|---|---| | Baseline | Frozen | Frozen | Frozen | Near 0% (zero-shot failure) | | A (Projector Only) | Frozen | Tuned | Frozen | Minor improvement; LLMLarge Language Model tone is wrong | | B (Standard LLaVA) | Frozen | Tuned | Tuned | Good structure, but misses subtle visual anomalies | | C (Full Recipe) | Tuned | Tuned | Tuned | Peak medical accuracy |

The primary evaluation metric is RadGraph F1, which uses a specialized clinical NLP parser to compute precision/recall over extracted medical entities, completely bypassing the flawed nnn-gram matching of BLEU.


Track B: Embodied VLMVision-Language Model System Design

Target Task: Design a VLMVision-Language Model-based perception and planning system for a robotic arm performing bimanual cloth folding. Constraints: The cloth deforms dynamically. The robot operates at 50Hz.

System Architecture (System 1 / System 2 Paradigm)

As derived in Course 2 and Course 4, we split the architecture to handle the latency mismatch:

1. Semantic Task Planner (System 2): A 7B VLMVision-Language Model runs asynchronously at 1 Hz on a local GPU server. It receives the high-level language goal ("Fold the towel in half") and overhead RGB-D images. It uses Chain-of-Thought (ReAct) to reason about the cloth's current state and outputs a dense Goal Embedding Vector egoale_\text{goal}egoal​ representing the next geometric sub-goal (e.g., "grasp bottom-left corner and pull to top-left corner").

2. High-Frequency Controller (System 1): An Action Chunking Transformer (ACTAction Chunking with Transformers) or Diffusion Policy runs at 50Hz directly on the robot's edge-compute board. It takes the continuous stream of joint positions, local wrist-camera feeds, and the goal embedding egoale_\text{goal}egoal​ from the VLMVision-Language Model. It executes the continuous denoising steps to output 6-DOF joint velocities.

Closed-Loop Failure Analysis Tree

If the robot fails to fold the cloth, a rigorous debugging tree must be traversed:

  1. Perception Failure (System 2): Did the VLMVision-Language Model correctly ground the corners of the cloth? Check the 2D bounding boxes. If correct, check the depth map projection. If the depth map is noisy due to the cloth's lack of texture, the 3D (X,Y,Z)(X,Y,Z)(X,Y,Z) target sent to System 1 is mathematically corrupted.
  2. Quantization/Action Failure (System 1): Did the Diffusion policy generate a smooth trajectory? If the action was tokenized (RT-2 style), check the quantization bin error.
  3. Safety Layer Trigger: Did a Control Barrier Function (CBF) intercept and override the planned trajectory to prevent a self-collision? If the CBF γ\gammaγ parameter is too conservative, the robot will stall mid-fold.

Multi-Dimensional Evaluation Framework

Benchmark accuracy is a single, flawed dimension. A deployment-ready VLMVision-Language Model must report a radar chart of metrics:

  1. Task Accuracy: Grounding IoU (with strict threshold θ=0.9\theta=0.9θ=0.9 for robotics), RadGraph F1, or task success rate.
  2. Calibration (ECE): Does the model's softmax confidence correlate with reality? If the VLMVision-Language Model is 95% confident about a hallucinated object, the system is fundamentally unsafe.
  3. Robustness (Sim2Real Gap): Measure the exact performance drop when transitioning from high-resolution, static training images to blurry, motion-blurred, dynamically occluded physical webcams.
  4. Safety (Hallucination Rate): Evaluated using CHAIR or adversarial typographic attacks (testing if text pasted on an object overrides physical visual geometry).
  5. Efficiency: P95 Inference latency (in milliseconds) and VRAM usage.

Hardware and Reproducibility Mandates

VLMVision-Language Model research suffers from severe reproducibility failures because hardware latency and hyperparameter choices are deeply intertwined with algorithmic success. A complete VLMVision-Language Model system report must mathematically document:

  1. Hardware Constraints: Exact GPU model, VRAM limits, and FP16/INT4 quantization usage. A model that runs at 2Hz on an A100 may run at 0.1Hz on a Jetson Orin, entirely breaking a robotics closed-loop pipeline.
  2. Generation Hyperparameters: Temperature τ\tauτ, top-ppp nucleus sampling, and classifier-free guidance scales. Changing τ\tauτ from 0.1 to 0.7 transforms a deterministic grounding model into a stochastic hallucination engine.
  3. Random Seeds: Explicitly document seeds for data shuffling, network initialization, and the diffusion sampling noise schedules.

Course Retrospective: Foundation Models for Physical AI

This four-course sequence has traced the evolution of physical artificial intelligence from first mathematical principles.

We began in Course 1 (Reinforcement Learning), establishing the formal mathematics of Markov Decision Processes, Bellman equations, and the theoretical limits of optimal control and policy gradients (PPOProximal Policy Optimisation). In Course 2 (Robotics), we grounded those abstractions in physical reality, transforming abstract actions into continuous torques constrained by Euler-Lagrange dynamics, sensor noise, and Control Barrier Functions. In Course 3 (Generative Models), we explored the frontiers of probability, learning to model infinitely complex, high-dimensional distributions using Diffusion, Flow Matching, and VAEs. Finally, in Course 4 (Vision-Language Models), we achieved semantic alignment, connecting the unstructured pixels of the physical world to the abstract reasoning of human language through contrastive pretraining, grounding, and multimodal agentic loops.

The future of physical AI relies on the synthesis of these domains: using generative diffusion to power continuous robotic control, guided by the deep semantic reasoning of vision-language agents, all governed by the rigorous safety mathematics of classical control theory.


Capstone Project

Option A (Fine-Tuning Track) Select a highly specific, proprietary vision-language task (e.g., parsing architectural blueprints, identifying manufacturing defects on PCBs).

  1. Curate a dataset of at least 1,000 labeled examples.
  2. Fine-tune a LLaVA-1.5 or InstructBLIP model using QLoRA.
  3. Design and execute a strict three-condition ablation study (Zero-shot baseline, Projector-only tuning, Full QLoRA tuning).
  4. Evaluate on at least three dimensions (Task Accuracy, Calibration, Hallucination Rate). Write a 2,000-word engineering report detailing the specific VRAM footprint and the mathematical source of your model's primary failure mode.

Option B (Embodied Track) Design a complete VLMVision-Language Model-based perception and planning system for an autonomous drone inspecting wind turbines. The design document must mathematically specify:

  1. The Perception Module (How does it compress continuous 4K video feeds? Refer to Perceiver/Flamingo architectures).
  2. The Reasoning Module (How does the VLMVision-Language Model output structured navigation coordinates?).
  3. The Execution Module (How do those coordinates map to the drone's continuous control policy?).
  4. The Safety Layer (Define the Control Barrier Function that prevents the drone from crashing into the turbine blades).
  5. Provide a 50-trial testing protocol explicitly designed to evaluate performance under visual Covariate Shift (e.g., testing at night vs. day).

← Previous
Week 13: Bias, Fairness, and Safety in VLMs
On this page
  • The VLM System Design Process
  • Track A: Domain-Specific VLM Fine-Tuning
  • System Architecture & Training Recipe
  • Ablation Study Design
  • Track B: Embodied VLM System Design
  • System Architecture (System 1 / System 2 Paradigm)
  • Closed-Loop Failure Analysis Tree
  • Multi-Dimensional Evaluation Framework
  • Hardware and Reproducibility Mandates
  • Course Retrospective: Foundation Models for Physical AI
  • Capstone Project