Skip to main content
illumin8
Courses
Week 9: Evaluation and Robustness
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 9

Week 9: Evaluation and Robustness

✦Learning Outcomes
  • Evaluate VLMs using modern reasoning benchmarks
  • Diagnose distribution shifts and spurious correlations in VLMVision-Language Model deployment
  • Design robust evaluation protocols for physical deployment
◆Prerequisites
  • Completion of most Course 4 weeks recommended
  • Understanding of VLMVision-Language Model architectures from earlier weeks

Purpose of this lecture

Building a Vision-Language Model and achieving high accuracy on a public benchmark are not the same as building a system that works reliably in physical deployment. Evaluation metrics for generative multimodal tasks are mathematically messy, frequently misunderstood, and routinely misaligned with actual deployment objectives. A model boasting "State-of-the-Art" (SOTA) on a VQA leaderboard will frequently fail when mounted on a robot traversing a slightly novel environment.

This lecture rigorously examines the evaluation landscape across captioning, Visual Question Answering (VQA), and grounding tasks. We analyze the mathematics behind historical metrics, explore the distribution shift phenomena (spurious correlations, covariate shift) that expose the gap between benchmarks and reality, and outline the adversarial testing methodologies required to mathematically prove a VLMVision-Language Model's robustness before physical deployment.


Captioning Metrics: The problem of scoring language

Image captioning is evaluated by comparing a machine-generated caption to a set of human-written reference captions. Because language is highly flexible, exact string matching is useless. The field historically adapted metrics from Machine Translation, each capturing a different mathematical aspect of text similarity:

BLEU-4 (Bilingual Evaluation Understudy)

BLEU (Papineni et al., 2002) measures simple nnn-gram precision: what fraction of the nnn-grams in the generated caption appear anywhere in the reference captions? BLEU-4 (using up to 4-grams) is computed as:

BLEU-4=BP⋅exp⁡ ⁣(∑n=14wnlog⁡pn)\text{BLEU-4} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{4} w_n \log p_n\right)BLEU-4=BP⋅exp(n=1∑4​wn​logpn​)

where pnp_npn​ is the clipped nnn-gram precision, wn=1/4w_n = 1/4wn​=1/4 are uniform weights, and BP\text{BP}BP is the Brevity Penalty:

BP={1if c>rexp⁡(1−r/c)if c≤r\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - r/c) & \text{if } c \leq r \end{cases}BP={1exp(1−r/c)​if c>rif c≤r​

(where ccc is the length of the candidate and rrr is the reference length).

The Flaw: BLEU is exceptionally poor at capturing semantic meaning. It heavily rewards models for matching common syntactic structures ("A photo of a...") while heavily penalizing valid paraphrases. A generated caption that says "A small canine" instead of the reference "A little dog" scores a BLEU of 000, despite being perfectly accurate.

CIDEr (Consensus-based Image Description Evaluation)

CIDEr (Vedantam et al., 2015) was designed specifically for image captioning. It weights nnn-gram matches by their TF-IDF (Term Frequency-Inverse Document Frequency) scores across the entire dataset.

wk(ci)=nk(ci)∑ω∈Ωnω(ci)×log⁡(∣I∣∑Ij∈Imin⁡(1,∑ω∈ci1{ω=k}))w_k(c_i) = \frac{n_k(c_i)}{\sum_{\omega \in \Omega} n_\omega(c_i)} \times \log\left(\frac{|I|}{\sum_{I_j \in I} \min(1, \sum_{\omega \in c_i} \mathbb{1}_{\{\omega=k\}})}\right)wk​(ci​)=∑ω∈Ω​nω​(ci​)nk​(ci​)​×log(∑Ij​∈I​min(1,∑ω∈ci​​1{ω=k}​)∣I∣​)

By mathematically down-weighting common nnn-grams (like "a", "the", "on a") and up-weighting rare, informative ones (like "stethoscope" or "skateboard"), CIDEr ensures the model is rewarded for identifying the actual subjects of the image. The final CIDEr score is the cosine similarity between the TF-IDF vectors of the candidate and the references. While CIDEr correlates better with human judgment than BLEU, optimizing a VLMVision-Language Model directly against CIDEr (using Reinforcement Learning, e.g., SCST) often results in "CIDEr-hacking," where the model outputs grammatically broken strings of rare nouns ("dog frisbee park catch") to maximize the TF-IDF dot product.

SPICE (Semantic Propositional Image Caption Evaluation)

SPICE (Anderson et al., 2016) abandons nnn-grams entirely. It uses a dependency parser to convert both the generated caption and the references into semantic scene graphs (nodes are object categories; edges are relations and attributes). It then computes an F1 score over the matched graph triples (e.g., (dog, holds, frisbee)). SPICE measures semantic truth directly, remaining invariant to paraphrasing, but it is highly computationally expensive and relies on the accuracy of a brittle external language parser.


VQA Evaluation and Linguistic Priors

Visual Question Answering (VQA) evaluation appears straightforward—compare the generated answer to the ground-truth string—but it hides deep structural biases.

VQA v2 Accuracy: To account for human disagreement (e.g., is the car "red" or "maroon"?), the standard VQA v2 benchmark collects 10 human answers per question. The mathematical accuracy for a predicted answer a^\hat{a}a^ is defined as:

Acc(a^)=min⁡ ⁣(1,∑i=1101{ai=a^}3)\text{Acc}(\hat{a}) = \min\!\left(1, \frac{\sum_{i=1}^{10} \mathbb{1}_{\{a_i = \hat{a}\}}}{3}\right)Acc(a^)=min(1,3∑i=110​1{ai​=a^}​​)

An answer receives full credit (1.01.01.0) if at least 3 out of 10 humans agreed with it.

The Problem of Linguistic Priors

The VQA v2 dataset suffers from severe class imbalance. For example, in the training set, if the question begins with "What sport is...", the answer is "tennis" over 40% of the time. If the question is "Is there a clock?", the answer is "yes" 70% of the time.

A "blind" neural network that completely ignores the image and simply memorizes the conditional text probability P(answer∣question text)P(\text{answer} \mid \text{question text})P(answer∣question text) will achieve surprisingly high accuracy. When a VLMVision-Language Model achieves 70% on a VQA benchmark, it is mathematically ambiguous whether the model actually possesses visual reasoning capabilities, or if it has simply memorized the linguistic priors of the dataset.


Grounding evaluation

Visual grounding tasks (Mapping text to specific bounding boxes) are evaluated with rigid localization metrics.

Intersection over Union (IoU): For a predicted bounding box B^\hat{B}B^ and ground-truth box BBB, the Jaccard index is:

IoU(B^,B)=Area(B^∩B)Area(B^∪B)\text{IoU}(\hat{B}, B) = \frac{\text{Area}(\hat{B} \cap B)}{\text{Area}(\hat{B} \cup B)}IoU(B^,B)=Area(B^∪B)Area(B^∩B)​

Acc@IoU θ\thetaθ: The standard metric is binary accuracy, where a prediction is deemed correct if IoU≥θ\text{IoU} \geq \thetaIoU≥θ (typically θ=0.5\theta = 0.5θ=0.5).

The Robotics Reality Gap: In standard computer vision, an IoU of 0.50.50.5 is considered a "success." However, if a robot uses a bounding box with an IoU of 0.50.50.5 to calculate the kinematics for grasping a delicate object, the gripper will likely miss the object entirely or crash into the table. Benchmark grounding metrics are fundamentally too forgiving for physical AI deployment, where sub-centimeter precision (or full pixel-wise segmentation) is strictly required.


Distribution Shift: The Benchmark-Reality Gap

VLMs trained on web-scale data face systematic failure modes when deployed in specialized domains. The core mathematical problem is that benchmark accuracy measures the expectation E(x,y)∼Ptest[L(y^,y)]\mathbb{E}_{(x,y)\sim P_\text{test}} [\mathcal{L}(\hat{y}, y)]E(x,y)∼Ptest​​[L(y^​,y)], where PtestP_\text{test}Ptest​ is drawn from the exact same distribution as PtrainP_\text{train}Ptrain​. In reality, the deployment distribution PdeployP_\text{deploy}Pdeploy​ is radically different.

Covariate Shift: The input distribution P(x)P(x)P(x) changes, but the causal relationship P(y∣x)P(y|x)P(y∣x) stays the same. For VLMs, this happens when the visual domain shifts (e.g., training on iPhone photos of streets, deploying on grainy infrared drone footage). If the VLMVision-Language Model's visual encoder memorized high-frequency texture statistics rather than structural geometry, performance plummets.

Spurious Correlations: The model exploits statistical correlations that co-occur in the training data but are not causally related. If every training image of a "doctor" features a person wearing a white coat in a hospital, the VLMVision-Language Model will learn the function f(white coat)→doctorf(\text{white coat}) \to \text{doctor}f(white coat)→doctor. If deployed in a laboratory where scientists wear white coats, the VLMVision-Language Model will confidently and consistently hallucinate doctors.

Compositional Generalization Failure: A VLMVision-Language Model may perfectly understand the concepts "red", "cube", "blue", and "sphere". However, when asked to evaluate the composition "A red cube balancing on top of a blue sphere", it fails. Standard benchmarks test the marginal distributions of concepts; true robustness requires testing the joint, compositional distribution.


Robustness Evaluation Methodology

Rigorous VLMVision-Language Model engineering requires abandoning standard test sets in favor of adversarial evaluation:

1. Contrast Sets / Counterfactuals: Instead of testing a single image, test pairs of images that are nearly identical but differ in one crucial semantic detail that changes the answer. If the VLMVision-Language Model answers correctly on Image A ("The light is red") but fails to change its answer on Image B ("The light is green"), it is relying on a spurious correlation or a language prior, not visual grounding.

2. Cross-Dataset Transfer (Sim2Real for VLMs): Evaluate a model trained on clean, high-resolution web data (e.g., COCO) on a dataset like VizWiz (images taken by blind users: often blurry, poorly framed, off-center). The performance drop between these two datasets is the exact equivalent of the Sim2Real gap discussed in Course 2 (Robotics). It quantifies how much of the model's performance was overfitted to the aesthetic framing of internet photography.

3. Typographic Attacks: CLIP-based vision encoders are highly susceptible to "reading" text in an image and allowing it to override visual geometry. Placing a sticky note that says "IPHONE" on an apple will cause many VLMs to confidently classify the object as an iPhone. Robustness testing must include adversarial text overlays to ensure the model respects physical geometry over OCR features.


Key takeaways

Standard VLMVision-Language Model metrics are flawed proxies for human judgment. BLEU relies on rigid nnn-gram matching, while CIDEr reweights based on TF-IDF to emphasize rare words, though both can be mathematically "hacked." VQA accuracy is heavily inflated by linguistic priors and dataset class imbalances. In grounding, the standard IoU≥0.5\text{IoU} \geq 0.5IoU≥0.5 success threshold is too loose to guarantee safety in physical robotics applications. Because benchmark test sets share the training distribution's flaws, high leaderboard scores mask severe vulnerabilities to covariate shift, spurious correlations, and compositional failures. Rigorous deployment requires adversarial contrast sets, cross-dataset transfer testing, and evaluations explicitly designed to break the model's reliance on statistical shortcuts.


Conceptual questions

  1. CIDEr TF-IDF Mathematics: Consider a generated caption that consists purely of five highly specific, rare nouns concatenated together (e.g., "stethoscope scalpel syringe clipboard doctor") with zero grammatical structure. Using the TF-IDF formulation of CIDEr, explain mathematically why this garbage string might achieve a higher CIDEr score than a perfectly grammatical sentence ("A medical professional is holding a stethoscope in the hospital"). If you were using Reinforcement Learning to optimize a VLMVision-Language Model's captioning head, how would you modify the reward function to prevent this specific "CIDEr-hacking" collapse?
  2. VQA Accuracy and Annotator Variance: A VQA v2 question asks "What color is the car?" 4 humans answered "maroon", 4 humans answered "burgundy", and 2 answered "dark red". The VLMVision-Language Model predicts "red". Calculate the exact VQA accuracy score for the VLMVision-Language Model's prediction. Explain why evaluating open-ended VQA tasks using exact string-matching against human annotations structurally penalizes VLMs for utilizing a broader, more descriptive vocabulary than the average annotator.
  3. Spurious Correlations vs. Causal Features: A VQA model is trained on a dataset where 98% of questions asking "What is the person riding?" correspond to images containing snow, and the answer is "snowboard." You deploy the model, and when shown an image of a person riding a skateboard on concrete, it predicts "snowboard." Formulate a mathematical argument defining whether the model learned P(snowboard∣riding)P(\text{snowboard} \mid \text{riding})P(snowboard∣riding) (a language prior) or P(snowboard∣white background pixels)P(\text{snowboard} \mid \text{white background pixels})P(snowboard∣white background pixels) (a visual spurious correlation). Design a specific contrast set of two images that would definitively prove which shortcut the model is exploiting.
  4. IoU Thresholds in Robotics: A VLMVision-Language Model grounding model achieves 92% Acc@IoU=0.5 on a benchmark. A robotics team integrates this model to direct a robotic arm to grasp a coffee mug by its handle. During physical trials, the robot repeatedly smashes its gripper into the main body of the mug. Draw a 2D bounding box diagram demonstrating how a predicted bounding box can mathematically achieve an IoU of 0.550.550.55 with the ground truth, yet completely exclude the spatial coordinates of the mug's handle. Why must physical AI systems evaluate grounding models using point-wise precision or Acc@IoU=0.9?
  5. Typographic Attacks on the Joint Space: You apply a typographic attack by pasting a piece of paper reading "STOP" onto a speed limit sign. A CLIP-based zero-shot classifier predicts the image is a stop sign. Using the mathematics of the joint image-text embedding space (from Week 3), explain exactly why the visual encoder fθf_\thetafθ​ mapped the raw pixels of the word "STOP" to a vector that achieved a higher cosine similarity with the text embedding for "A photo of a stop sign" than the actual physical geometry of the speed limit sign did.
✦Solutions
  1. CIDEr hacking. CIDEr weights n-gram matches by TF-IDF, so rare high-IDF nouns dominate the score; a string of correct rare nouns can outscore a fluent sentence that "spends" matches on low-IDF function words. To stop an RL captioner from exploiting this, add a fluency term to the reward — a language-model likelihood, BERTScore, or a grammaticality discriminator — so reward is not pure n-gram overlap.
  2. VQA accuracy. VQA score is min⁡(matching annotators/3, 1)\min(\text{matching annotators}/3,\ 1)min(matching annotators/3, 1). "red" matches none of maroon/burgundy/dark red, so the score is 0/3=00/3 = 00/3=0. Exact string matching penalizes correct-but-differently-worded answers, structurally punishing a richer vocabulary than the annotator consensus.
  3. Spurious correlation. Build a contrast set: image A is a skateboard on snow/white background, image B is a snowboard on concrete/dark background. If predictions track the background (snow → "snowboard") the model learned P(snowboard∣white pixels)P(\text{snowboard}\mid \text{white pixels})P(snowboard∣white pixels), the visual shortcut; if they track the object regardless of background it learned the causal feature. The two images isolate which cue drives the prediction.
  4. IoU vs handle. A predicted box shifted onto the mug body can overlap the ground-truth box enough for IoU 0.550.550.55 — the large body dominates both intersection and union — while entirely excluding the thin handle region. IoU rewards bulk-area overlap, not part coverage, so physical grasping (which needs the handle's coordinates) must evaluate with point-wise precision or a high threshold like Acc@IoU=0.9.
  5. Typographic attack. CLIP learned that pixels spelling "STOP" co-occur with stop-sign captions, so its encoder maps the rendered word to a region of the joint space near the "a photo of a stop sign" text embedding — a higher cosine similarity than the actual speed-sign geometry achieves. The model reads text semantically rather than the physical object, so the pasted word overrides the true sign.

Looking ahead

Evaluation reveals the boundaries of what VLMs can naturally perceive and generate. The next question is how to forcefully control what they generate, injecting strict structural constraints into the generative process.

Week 10: ControlNet and Controlled Generation. We examine how structural conditioning (edges, depth maps, pose skeletons) is used to mathematically steer diffusion-based models, how zero-initialized control networks integrate with frozen backbones, and how VLMs serve as high-level semantic controllers that translate language into structured signals for physical action.


Further reading

  • Vedantam, R., et al. (2015). CIDEr: Consensus-based Image Description Evaluation. CVPR.
  • Anderson, P., et al. (2016). SPICE: Semantic Propositional Image Caption Evaluation. ECCV.
  • Gardner, M., et al. (2020). Evaluating Models' Local Decision Boundaries via Contrast Sets. EMNLP. (Adversarial robustness evaluation).
← Previous
Week 8: Fine-Tuning and Parameter-Efficient Methods
Next →
Week 10: ControlNet and Controlled Generation
On this page
  • Purpose of this lecture
  • Captioning Metrics: The problem of scoring language
  • BLEU-4 (Bilingual Evaluation Understudy)
  • CIDEr (Consensus-based Image Description Evaluation)
  • SPICE (Semantic Propositional Image Caption Evaluation)
  • VQA Evaluation and Linguistic Priors
  • The Problem of Linguistic Priors
  • Grounding evaluation
  • Distribution Shift: The Benchmark-Reality Gap
  • Robustness Evaluation Methodology
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading