Week 14: Vision-Language Capstone

Prerequisites

Week 1-6 - Foundational VLM architectures (ViT, CLIP, BLIP, LLaVA)
Week 8: Fine-Tuning - LoRA and parameter-efficient methods
Week 12: Robotics - VLM for robotics (Track B)

This chapter integrates concepts from all previous weeks into a coherent system design methodology.

The preceding thirteen weeks established the theoretical and structural foundations of Vision-Language Models: from ViT perception and contrastive pretraining, through generative architectures like LLaVA and Flamingo, to the physical embodiment of VLMs as robotic agents governed by strict safety and alignment protocols.

The capstone synthesizes these isolated concepts into a complete, end-to-end practitioner methodology. In physical AI, combining a perfect vision encoder with a perfect language model does not guarantee a functioning system; the architectural interfaces, the data curation, the evaluation rigor, and the hardware latency budgets dictate success. Two detailed case study tracks—Domain-Specific Fine-Tuning (Track A) and Embodied System Design (Track B)—ground this methodology in concrete engineering decisions, culminating in a rigorous framework for building deployable multimodal systems.

The VLM System Design Process#

Designing a VLM system for a real-world application requires a strict sequence of decisions. Errors at step 1 geometrically compound by step 5.

Step 1: Task Formulation & Constraint Mapping. Define the task precisely in terms of inputs, outputs, and physical constraints. "Build a VLM for autonomous driving" is an invalid formulation. "Classify traffic light states from 1080p dashboard video at 30Hz with $\leq 15\text{ms}$ latency and $\geq 99.99\%$ True Positive Rate" is valid. The latency constraint immediately rules out autoregressive text generation (LLaVA), pointing instead to a fast Perceiver-style encoder or a direct linear probe on a frozen CLIP backbone.

Step 2: Data Inventory & Modality Matching. Catalog the available data relative to the target task complexity. For a domain with 500 labeled examples, full fine-tuning of a 7B model is mathematically impossible without catastrophic forgetting; LoRA with heavy pre-training regularization is required. Furthermore, match the modalities: if the task requires predicting continuous mechanical forces, the VLM architecture must be augmented to process 1D force-torque sequences (Week 7) alongside RGB images.

Step 3: Architecture Selection. Apply the architectural decision framework:

Does the task require sub-centimeter physical grounding? Use an MAE/DINOv2 vision backbone, not CLIP.
Does it require multi-camera video support without quadratic context explosion? Use Flamingo's depth-distributed gated cross-attention.
Does it require generating complex reasoning chains? Use an instruction-tuned LLM backbone with ReAct prompting.

Step 4: Pretraining and Fine-Tuning Strategy. Select the adaptation method. Projector-only adaptation works for pure visual domain shifts; LLM LoRA works for adapting the reasoning style; QLoRA is required if compute is bottlenecked to a single GPU.

Step 5: Adversarial Evaluation Design. Define the evaluation suite before training begins. Relying purely on in-distribution metrics like CIDEr or standard VQA accuracy guarantees a Sim2Real reality gap. Construct explicit contrast sets to test spurious correlations and compositional failures (Week 9).

Track A: Domain-Specific VLM Fine-Tuning#

Target Task: Fine-tune a LLaVA-1.5-7B model to generate structured radiology reports for chest X-rays. Input: A chest X-ray image and a clinical context prompt ("Patient is a 65-year-old male with chronic cough"). Output: A structured report with sections for Findings, Impression, and Recommendations.

System Architecture & Training Recipe#

Medical imagery is a severe visual domain shift from the natural images (ImageNet/COCO) used to pretrain CLIP. Therefore, the visual features must be explicitly adapted.

Stage 1 (Alignment): Unfreeze the MLP projector and the CLIP ViT-L encoder using a small LoRA rank ( $r=8$ ). Train on 100K simple (X-ray, one-sentence finding) pairs. This corrects the visual domain shift, forcing the ViT to represent medical anomalies rather than ignoring them as noise.
Stage 2 (Instruction Tuning): Freeze the ViT. Apply QLoRA ( $r=32$ ) to the LLM's attention matrices ( $W_Q, W_V$ ). Fine-tune on full report generation using clinical context prompts. To prevent catastrophic forgetting of general language capabilities, use Data Mixing: inject 5% standard conversational data into the batch.

Ablation Study Design#

A rigorous ablation study isolates the mathematical contribution of each component.

| Condition | ViT LoRA | Projector | LLM LoRA | Expected Metric Impact | |---|---|---|---|---| | Baseline | Frozen | Frozen | Frozen | Near 0% (zero-shot failure) | | A (Projector Only) | Frozen | Tuned | Frozen | Minor improvement; LLM tone is wrong | | B (Standard LLaVA) | Frozen | Tuned | Tuned | Good structure, but misses subtle visual anomalies | | C (Full Recipe) | Tuned | Tuned | Tuned | Peak medical accuracy |

The primary evaluation metric is RadGraph F1, which uses a specialized clinical NLP parser to compute precision/recall over extracted medical entities, completely bypassing the flawed $n$ -gram matching of BLEU.

Track B: Embodied VLM System Design#

Target Task: Design a VLM-based perception and planning system for a robotic arm performing bimanual cloth folding. Constraints: The cloth deforms dynamically. The robot operates at 50Hz.

System Architecture (System 1 / System 2 Paradigm)#

As derived in Course 2 and Course 4, we split the architecture to handle the latency mismatch:

1. Semantic Task Planner (System 2): A 7B VLM runs asynchronously at 1 Hz on a local GPU server. It receives the high-level language goal ("Fold the towel in half") and overhead RGB-D images. It uses Chain-of-Thought (ReAct) to reason about the cloth's current state and outputs a dense Goal Embedding Vector $e_\text{goal}$ representing the next geometric sub-goal (e.g., "grasp bottom-left corner and pull to top-left corner").

2. High-Frequency Controller (System 1): An Action Chunking Transformer (ACT) or Diffusion Policy runs at 50Hz directly on the robot's edge-compute board. It takes the continuous stream of joint positions, local wrist-camera feeds, and the goal embedding $e_\text{goal}$ from the VLM. It executes the continuous denoising steps to output 6-DOF joint velocities.

Closed-Loop Failure Analysis Tree#

If the robot fails to fold the cloth, a rigorous debugging tree must be traversed:

Perception Failure (System 2): Did the VLM correctly ground the corners of the cloth? Check the 2D bounding boxes. If correct, check the depth map projection. If the depth map is noisy due to the cloth's lack of texture, the 3D $(X,Y,Z)$ target sent to System 1 is mathematically corrupted.
Quantization/Action Failure (System 1): Did the Diffusion policy generate a smooth trajectory? If the action was tokenized (RT-2 style), check the quantization bin error.
Safety Layer Trigger: Did a Control Barrier Function (CBF) intercept and override the planned trajectory to prevent a self-collision? If the CBF $\gamma$ parameter is too conservative, the robot will stall mid-fold.

Multi-Dimensional Evaluation Framework#

Benchmark accuracy is a single, flawed dimension. A deployment-ready VLM must report a radar chart of metrics:

Task Accuracy: Grounding IoU (with strict threshold $\theta=0.9$ for robotics), RadGraph F1, or task success rate.
Calibration (ECE): Does the model's softmax confidence correlate with reality? If the VLM is 95% confident about a hallucinated object, the system is fundamentally unsafe.
Robustness (Sim2Real Gap): Measure the exact performance drop when transitioning from high-resolution, static training images to blurry, motion-blurred, dynamically occluded physical webcams.
Safety (Hallucination Rate): Evaluated using CHAIR or adversarial typographic attacks (testing if text pasted on an object overrides physical visual geometry).
Efficiency: P95 Inference latency (in milliseconds) and VRAM usage.

Hardware and Reproducibility Mandates#

VLM research suffers from severe reproducibility failures because hardware latency and hyperparameter choices are deeply intertwined with algorithmic success. A complete VLM system report must mathematically document:

Hardware Constraints: Exact GPU model, VRAM limits, and FP16/INT4 quantization usage. A model that runs at 2Hz on an A100 may run at 0.1Hz on a Jetson Orin, entirely breaking a robotics closed-loop pipeline.
Generation Hyperparameters: Temperature $\tau$ , top- $p$ nucleus sampling, and classifier-free guidance scales. Changing $\tau$ from 0.1 to 0.7 transforms a deterministic grounding model into a stochastic hallucination engine.
Random Seeds: Explicitly document seeds for data shuffling, network initialization, and the diffusion sampling noise schedules.

Course Retrospective: Foundation Models for Physical AI#

This four-course sequence has traced the evolution of physical artificial intelligence from first mathematical principles.

We began in Course 1 (Reinforcement Learning), establishing the formal mathematics of Markov Decision Processes, Bellman equations, and the theoretical limits of optimal control and policy gradients (PPO). In Course 2 (Robotics), we grounded those abstractions in physical reality, transforming abstract actions into continuous torques constrained by Euler-Lagrange dynamics, sensor noise, and Control Barrier Functions. In Course 3 (Generative Models), we explored the frontiers of probability, learning to model infinitely complex, high-dimensional distributions using Diffusion, Flow Matching, and VAEs. Finally, in Course 4 (Vision-Language Models), we achieved semantic alignment, connecting the unstructured pixels of the physical world to the abstract reasoning of human language through contrastive pretraining, grounding, and multimodal agentic loops.

The future of physical AI relies on the synthesis of these domains: using generative diffusion to power continuous robotic control, guided by the deep semantic reasoning of vision-language agents, all governed by the rigorous safety mathematics of classical control theory.

Capstone Project#

Option A (Fine-Tuning Track) Select a highly specific, proprietary vision-language task (e.g., parsing architectural blueprints, identifying manufacturing defects on PCBs).

Curate a dataset of at least 1,000 labeled examples.
Fine-tune a LLaVA-1.5 or InstructBLIP model using QLoRA.
Design and execute a strict three-condition ablation study (Zero-shot baseline, Projector-only tuning, Full QLoRA tuning).
Evaluate on at least three dimensions (Task Accuracy, Calibration, Hallucination Rate). Write a 2,000-word engineering report detailing the specific VRAM footprint and the mathematical source of your model's primary failure mode.

Option B (Embodied Track) Design a complete VLM-based perception and planning system for an autonomous drone inspecting wind turbines. The design document must mathematically specify:

The Perception Module (How does it compress continuous 4K video feeds? Refer to Perceiver/Flamingo architectures).
The Reasoning Module (How does the VLM output structured navigation coordinates?).
The Execution Module (How do those coordinates map to the drone's continuous control policy?).
The Safety Layer (Define the Control Barrier Function that prevents the drone from crashing into the turbine blades).
Provide a 50-trial testing protocol explicitly designed to evaluate performance under visual Covariate Shift (e.g., testing at night vs. day).

← Previous

Week 13: Bias, Fairness, and Safety in VLMs

Prerequisites

Week 1-6 - Foundational VLM architectures (ViT, CLIP, BLIP, LLaVA)
Week 8: Fine-Tuning - LoRA and parameter-efficient methods
Week 12: Robotics - VLM for robotics (Track B)

This chapter integrates concepts from all previous weeks into a coherent system design methodology.

The VLM System Design Process#

Designing a VLM system for a real-world application requires a strict sequence of decisions. Errors at step 1 geometrically compound by step 5.

Step 3: Architecture Selection. Apply the architectural decision framework:

Does the task require sub-centimeter physical grounding? Use an MAE/DINOv2 vision backbone, not CLIP.
Does it require multi-camera video support without quadratic context explosion? Use Flamingo's depth-distributed gated cross-attention.
Does it require generating complex reasoning chains? Use an instruction-tuned LLM backbone with ReAct prompting.

Track A: Domain-Specific VLM Fine-Tuning#

System Architecture & Training Recipe#

Medical imagery is a severe visual domain shift from the natural images (ImageNet/COCO) used to pretrain CLIP. Therefore, the visual features must be explicitly adapted.

Stage 1 (Alignment): Unfreeze the MLP projector and the CLIP ViT-L encoder using a small LoRA rank ( $r=8$ ). Train on 100K simple (X-ray, one-sentence finding) pairs. This corrects the visual domain shift, forcing the ViT to represent medical anomalies rather than ignoring them as noise.
Stage 2 (Instruction Tuning): Freeze the ViT. Apply QLoRA ( $r=32$ ) to the LLM's attention matrices ( $W_Q, W_V$ ). Fine-tune on full report generation using clinical context prompts. To prevent catastrophic forgetting of general language capabilities, use Data Mixing: inject 5% standard conversational data into the batch.

Ablation Study Design#

A rigorous ablation study isolates the mathematical contribution of each component.

Track B: Embodied VLM System Design#

Target Task: Design a VLM-based perception and planning system for a robotic arm performing bimanual cloth folding. Constraints: The cloth deforms dynamically. The robot operates at 50Hz.

System Architecture (System 1 / System 2 Paradigm)#

As derived in Course 2 and Course 4, we split the architecture to handle the latency mismatch:

Closed-Loop Failure Analysis Tree#

If the robot fails to fold the cloth, a rigorous debugging tree must be traversed:

Perception Failure (System 2): Did the VLM correctly ground the corners of the cloth? Check the 2D bounding boxes. If correct, check the depth map projection. If the depth map is noisy due to the cloth's lack of texture, the 3D $(X,Y,Z)$ target sent to System 1 is mathematically corrupted.
Quantization/Action Failure (System 1): Did the Diffusion policy generate a smooth trajectory? If the action was tokenized (RT-2 style), check the quantization bin error.
Safety Layer Trigger: Did a Control Barrier Function (CBF) intercept and override the planned trajectory to prevent a self-collision? If the CBF $\gamma$ parameter is too conservative, the robot will stall mid-fold.

Multi-Dimensional Evaluation Framework#

Benchmark accuracy is a single, flawed dimension. A deployment-ready VLM must report a radar chart of metrics:

Task Accuracy: Grounding IoU (with strict threshold $\theta=0.9$ for robotics), RadGraph F1, or task success rate.
Calibration (ECE): Does the model's softmax confidence correlate with reality? If the VLM is 95% confident about a hallucinated object, the system is fundamentally unsafe.
Robustness (Sim2Real Gap): Measure the exact performance drop when transitioning from high-resolution, static training images to blurry, motion-blurred, dynamically occluded physical webcams.
Safety (Hallucination Rate): Evaluated using CHAIR or adversarial typographic attacks (testing if text pasted on an object overrides physical visual geometry).
Efficiency: P95 Inference latency (in milliseconds) and VRAM usage.

Hardware and Reproducibility Mandates#

Hardware Constraints: Exact GPU model, VRAM limits, and FP16/INT4 quantization usage. A model that runs at 2Hz on an A100 may run at 0.1Hz on a Jetson Orin, entirely breaking a robotics closed-loop pipeline.
Generation Hyperparameters: Temperature $\tau$ , top- $p$ nucleus sampling, and classifier-free guidance scales. Changing $\tau$ from 0.1 to 0.7 transforms a deterministic grounding model into a stochastic hallucination engine.
Random Seeds: Explicitly document seeds for data shuffling, network initialization, and the diffusion sampling noise schedules.

Course Retrospective: Foundation Models for Physical AI#

This four-course sequence has traced the evolution of physical artificial intelligence from first mathematical principles.

Capstone Project#

Option A (Fine-Tuning Track) Select a highly specific, proprietary vision-language task (e.g., parsing architectural blueprints, identifying manufacturing defects on PCBs).

Curate a dataset of at least 1,000 labeled examples.
Fine-tune a LLaVA-1.5 or InstructBLIP model using QLoRA.
Design and execute a strict three-condition ablation study (Zero-shot baseline, Projector-only tuning, Full QLoRA tuning).
Evaluate on at least three dimensions (Task Accuracy, Calibration, Hallucination Rate). Write a 2,000-word engineering report detailing the specific VRAM footprint and the mathematical source of your model's primary failure mode.

Option B (Embodied Track) Design a complete VLM-based perception and planning system for an autonomous drone inspecting wind turbines. The design document must mathematically specify:

The Perception Module (How does it compress continuous 4K video feeds? Refer to Perceiver/Flamingo architectures).
The Reasoning Module (How does the VLM output structured navigation coordinates?).
The Execution Module (How do those coordinates map to the drone's continuous control policy?).
The Safety Layer (Define the Control Barrier Function that prevents the drone from crashing into the turbine blades).
Provide a 50-trial testing protocol explicitly designed to evaluate performance under visual Covariate Shift (e.g., testing at night vs. day).

← Previous

Week 13: Bias, Fairness, and Safety in VLMs

The VLMVision-Language Model System Design Process#

Track A: Domain-Specific VLMVision-Language Model Fine-Tuning#

System Architecture & Training Recipe#

Ablation Study Design#

Track B: Embodied VLMVision-Language Model System Design#

System Architecture (System 1 / System 2 Paradigm)#

Closed-Loop Failure Analysis Tree#

Multi-Dimensional Evaluation Framework#

Hardware and Reproducibility Mandates#

Course Retrospective: Foundation Models for Physical AI#

Capstone Project#

Week 14: Vision-Language Capstone

The VLMVision-Language Model System Design Process#

Track A: Domain-Specific VLMVision-Language Model Fine-Tuning#

System Architecture & Training Recipe#

Ablation Study Design#

Track B: Embodied VLMVision-Language Model System Design#

System Architecture (System 1 / System 2 Paradigm)#

Closed-Loop Failure Analysis Tree#

Multi-Dimensional Evaluation Framework#

Hardware and Reproducibility Mandates#

Course Retrospective: Foundation Models for Physical AI#

Capstone Project#

The VLM System Design Process#

Track A: Domain-Specific VLM Fine-Tuning#

Track B: Embodied VLM System Design#

The VLM System Design Process#

Track A: Domain-Specific VLM Fine-Tuning#

Track B: Embodied VLM System Design#