Week 9: Evaluation and Robustness

Purpose of this lecture#

Building a Vision-Language Model and achieving high accuracy on a public benchmark are not the same as building a system that works reliably in physical deployment. Evaluation metrics for generative multimodal tasks are mathematically messy, frequently misunderstood, and routinely misaligned with actual deployment objectives. A model boasting "State-of-the-Art" (SOTA) on a VQA leaderboard will frequently fail when mounted on a robot traversing a slightly novel environment.

This lecture rigorously examines the evaluation landscape across captioning, Visual Question Answering (VQA), and grounding tasks. We analyze the mathematics behind historical metrics, explore the distribution shift phenomena (spurious correlations, covariate shift) that expose the gap between benchmarks and reality, and outline the adversarial testing methodologies required to mathematically prove a VLM's robustness before physical deployment.

Captioning Metrics: The problem of scoring language#

Image captioning is evaluated by comparing a machine-generated caption to a set of human-written reference captions. Because language is highly flexible, exact string matching is useless. The field historically adapted metrics from Machine Translation, each capturing a different mathematical aspect of text similarity:

BLEU-4 (Bilingual Evaluation Understudy)#

BLEU (Papineni et al., 2002) measures simple $n$ -gram precision: what fraction of the $n$ -grams in the generated caption appear anywhere in the reference captions? BLEU-4 (using up to 4-grams) is computed as:

\text{BLEU-4} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{4} w_n \log p_n\right)

where $p_n$ is the clipped $n$ -gram precision, $w_n = 1/4$ are uniform weights, and $\text{BP}$ is the Brevity Penalty:

\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - r/c) & \text{if } c \leq r \end{cases}

(where $c$ is the length of the candidate and $r$ is the reference length).

The Flaw: BLEU is exceptionally poor at capturing semantic meaning. It heavily rewards models for matching common syntactic structures ("A photo of a...") while heavily penalizing valid paraphrases. A generated caption that says "A small canine" instead of the reference "A little dog" scores a BLEU of $0$ , despite being perfectly accurate.

CIDEr (Consensus-based Image Description Evaluation)#

CIDEr (Vedantam et al., 2015) was designed specifically for image captioning. It weights $n$ -gram matches by their TF-IDF (Term Frequency-Inverse Document Frequency) scores across the entire dataset.

w_k(c_i) = \frac{n_k(c_i)}{\sum_{\omega \in \Omega} n_\omega(c_i)} \times \log\left(\frac{|I|}{\sum_{I_j \in I} \min(1, \sum_{\omega \in c_i} \mathbb{1}_{\{\omega=k\}})}\right)

By mathematically down-weighting common $n$ -grams (like "a", "the", "on a") and up-weighting rare, informative ones (like "stethoscope" or "skateboard"), CIDEr ensures the model is rewarded for identifying the actual subjects of the image. The final CIDEr score is the cosine similarity between the TF-IDF vectors of the candidate and the references. While CIDEr correlates better with human judgment than BLEU, optimizing a VLM directly against CIDEr (using Reinforcement Learning, e.g., SCST) often results in "CIDEr-hacking," where the model outputs grammatically broken strings of rare nouns ("dog frisbee park catch") to maximize the TF-IDF dot product.

SPICE (Semantic Propositional Image Caption Evaluation)#

SPICE (Anderson et al., 2016) abandons $n$ -grams entirely. It uses a dependency parser to convert both the generated caption and the references into semantic scene graphs (nodes are object categories; edges are relations and attributes). It then computes an F1 score over the matched graph triples (e.g., (dog, holds, frisbee)). SPICE measures semantic truth directly, remaining invariant to paraphrasing, but it is highly computationally expensive and relies on the accuracy of a brittle external language parser.

VQA Evaluation and Linguistic Priors#

Visual Question Answering (VQA) evaluation appears straightforward—compare the generated answer to the ground-truth string—but it hides deep structural biases.

VQA v2 Accuracy: To account for human disagreement (e.g., is the car "red" or "maroon"?), the standard VQA v2 benchmark collects 10 human answers per question. The mathematical accuracy for a predicted answer $\hat{a}$ is defined as:

\text{Acc}(\hat{a}) = \min\!\left(1, \frac{\sum_{i=1}^{10} \mathbb{1}_{\{a_i = \hat{a}\}}}{3}\right)

An answer receives full credit ( $1.0$ ) if at least 3 out of 10 humans agreed with it.

The Problem of Linguistic Priors#

The VQA v2 dataset suffers from severe class imbalance. For example, in the training set, if the question begins with "What sport is...", the answer is "tennis" over 40% of the time. If the question is "Is there a clock?", the answer is "yes" 70% of the time.

A "blind" neural network that completely ignores the image and simply memorizes the conditional text probability $P(\text{answer} \mid \text{question text})$ will achieve surprisingly high accuracy. When a VLM achieves 70% on a VQA benchmark, it is mathematically ambiguous whether the model actually possesses visual reasoning capabilities, or if it has simply memorized the linguistic priors of the dataset.

Grounding evaluation#

Visual grounding tasks (Mapping text to specific bounding boxes) are evaluated with rigid localization metrics.

Intersection over Union (IoU): For a predicted bounding box $\hat{B}$ and ground-truth box $B$ , the Jaccard index is:

\text{IoU}(\hat{B}, B) = \frac{\text{Area}(\hat{B} \cap B)}{\text{Area}(\hat{B} \cup B)}

Acc@IoU $\theta$ : The standard metric is binary accuracy, where a prediction is deemed correct if $\text{IoU} \geq \theta$ (typically $\theta = 0.5$ ).

The Robotics Reality Gap: In standard computer vision, an IoU of $0.5$ is considered a "success." However, if a robot uses a bounding box with an IoU of $0.5$ to calculate the kinematics for grasping a delicate object, the gripper will likely miss the object entirely or crash into the table. Benchmark grounding metrics are fundamentally too forgiving for physical AI deployment, where sub-centimeter precision (or full pixel-wise segmentation) is strictly required.

Distribution Shift: The Benchmark-Reality Gap#

VLMs trained on web-scale data face systematic failure modes when deployed in specialized domains. The core mathematical problem is that benchmark accuracy measures the expectation $\mathbb{E}_{(x,y)\sim P_\text{test}} [\mathcal{L}(\hat{y}, y)]$ , where $P_\text{test}$ is drawn from the exact same distribution as $P_\text{train}$ . In reality, the deployment distribution $P_\text{deploy}$ is radically different.

Covariate Shift: The input distribution $P(x)$ changes, but the causal relationship $P(y|x)$ stays the same. For VLMs, this happens when the visual domain shifts (e.g., training on iPhone photos of streets, deploying on grainy infrared drone footage). If the VLM's visual encoder memorized high-frequency texture statistics rather than structural geometry, performance plummets.

Spurious Correlations: The model exploits statistical correlations that co-occur in the training data but are not causally related. If every training image of a "doctor" features a person wearing a white coat in a hospital, the VLM will learn the function $f(\text{white coat}) \to \text{doctor}$ . If deployed in a laboratory where scientists wear white coats, the VLM will confidently and consistently hallucinate doctors.

Compositional Generalization Failure: A VLM may perfectly understand the concepts "red", "cube", "blue", and "sphere". However, when asked to evaluate the composition "A red cube balancing on top of a blue sphere", it fails. Standard benchmarks test the marginal distributions of concepts; true robustness requires testing the joint, compositional distribution.

Robustness Evaluation Methodology#

Rigorous VLM engineering requires abandoning standard test sets in favor of adversarial evaluation:

1. Contrast Sets / Counterfactuals: Instead of testing a single image, test pairs of images that are nearly identical but differ in one crucial semantic detail that changes the answer. If the VLM answers correctly on Image A ("The light is red") but fails to change its answer on Image B ("The light is green"), it is relying on a spurious correlation or a language prior, not visual grounding.

2. Cross-Dataset Transfer (Sim2Real for VLMs): Evaluate a model trained on clean, high-resolution web data (e.g., COCO) on a dataset like VizWiz (images taken by blind users: often blurry, poorly framed, off-center). The performance drop between these two datasets is the exact equivalent of the Sim2Real gap discussed in Course 2 (Robotics). It quantifies how much of the model's performance was overfitted to the aesthetic framing of internet photography.

3. Typographic Attacks: CLIP-based vision encoders are highly susceptible to "reading" text in an image and allowing it to override visual geometry. Placing a sticky note that says "IPHONE" on an apple will cause many VLMs to confidently classify the object as an iPhone. Robustness testing must include adversarial text overlays to ensure the model respects physical geometry over OCR features.

Key takeaways#

Standard VLM metrics are flawed proxies for human judgment. BLEU relies on rigid $n$ -gram matching, while CIDEr reweights based on TF-IDF to emphasize rare words, though both can be mathematically "hacked." VQA accuracy is heavily inflated by linguistic priors and dataset class imbalances. In grounding, the standard $\text{IoU} \geq 0.5$ success threshold is too loose to guarantee safety in physical robotics applications. Because benchmark test sets share the training distribution's flaws, high leaderboard scores mask severe vulnerabilities to covariate shift, spurious correlations, and compositional failures. Rigorous deployment requires adversarial contrast sets, cross-dataset transfer testing, and evaluations explicitly designed to break the model's reliance on statistical shortcuts.

Conceptual questions#

CIDEr TF-IDF Mathematics: Consider a generated caption that consists purely of five highly specific, rare nouns concatenated together (e.g., "stethoscope scalpel syringe clipboard doctor") with zero grammatical structure. Using the TF-IDF formulation of CIDEr, explain mathematically why this garbage string might achieve a higher CIDEr score than a perfectly grammatical sentence ("A medical professional is holding a stethoscope in the hospital"). If you were using Reinforcement Learning to optimize a VLM's captioning head, how would you modify the reward function to prevent this specific "CIDEr-hacking" collapse?
VQA Accuracy and Annotator Variance: A VQA v2 question asks "What color is the car?" 4 humans answered "maroon", 4 humans answered "burgundy", and 2 answered "dark red". The VLM predicts "red". Calculate the exact VQA accuracy score for the VLM's prediction. Explain why evaluating open-ended VQA tasks using exact string-matching against human annotations structurally penalizes VLMs for utilizing a broader, more descriptive vocabulary than the average annotator.
Spurious Correlations vs. Causal Features: A VQA model is trained on a dataset where 98% of questions asking "What is the person riding?" correspond to images containing snow, and the answer is "snowboard." You deploy the model, and when shown an image of a person riding a skateboard on concrete, it predicts "snowboard." Formulate a mathematical argument defining whether the model learned $P(\text{snowboard} \mid \text{riding})$ (a language prior) or $P(\text{snowboard} \mid \text{white background pixels})$ (a visual spurious correlation). Design a specific contrast set of two images that would definitively prove which shortcut the model is exploiting.
IoU Thresholds in Robotics: A VLM grounding model achieves 92% Acc@IoU=0.5 on a benchmark. A robotics team integrates this model to direct a robotic arm to grasp a coffee mug by its handle. During physical trials, the robot repeatedly smashes its gripper into the main body of the mug. Draw a 2D bounding box diagram demonstrating how a predicted bounding box can mathematically achieve an IoU of $0.55$ with the ground truth, yet completely exclude the spatial coordinates of the mug's handle. Why must physical AI systems evaluate grounding models using point-wise precision or Acc@IoU=0.9?
Typographic Attacks on the Joint Space: You apply a typographic attack by pasting a piece of paper reading "STOP" onto a speed limit sign. A CLIP-based zero-shot classifier predicts the image is a stop sign. Using the mathematics of the joint image-text embedding space (from Week 3), explain exactly why the visual encoder $f_\theta$ mapped the raw pixels of the word "STOP" to a vector that achieved a higher cosine similarity with the text embedding for "A photo of a stop sign" than the actual physical geometry of the speed limit sign did.

Solutions

CIDEr hacking. CIDEr weights n-gram matches by TF-IDF, so rare high-IDF nouns dominate the score; a string of correct rare nouns can outscore a fluent sentence that "spends" matches on low-IDF function words. To stop an RL captioner from exploiting this, add a fluency term to the reward — a language-model likelihood, BERTScore, or a grammaticality discriminator — so reward is not pure n-gram overlap.
VQA accuracy. VQA score is $\min(\text{matching annotators}/3,\ 1)$ . "red" matches none of maroon/burgundy/dark red, so the score is $0/3 = 0$ . Exact string matching penalizes correct-but-differently-worded answers, structurally punishing a richer vocabulary than the annotator consensus.
Spurious correlation. Build a contrast set: image A is a skateboard on snow/white background, image B is a snowboard on concrete/dark background. If predictions track the background (snow → "snowboard") the model learned $P(\text{snowboard}\mid \text{white pixels})$ , the visual shortcut; if they track the object regardless of background it learned the causal feature. The two images isolate which cue drives the prediction.
IoU vs handle. A predicted box shifted onto the mug body can overlap the ground-truth box enough for IoU $0.55$ — the large body dominates both intersection and union — while entirely excluding the thin handle region. IoU rewards bulk-area overlap, not part coverage, so physical grasping (which needs the handle's coordinates) must evaluate with point-wise precision or a high threshold like Acc@IoU=0.9.
Typographic attack. CLIP learned that pixels spelling "STOP" co-occur with stop-sign captions, so its encoder maps the rendered word to a region of the joint space near the "a photo of a stop sign" text embedding — a higher cosine similarity than the actual speed-sign geometry achieves. The model reads text semantically rather than the physical object, so the pasted word overrides the true sign.

Looking ahead#

Evaluation reveals the boundaries of what VLMs can naturally perceive and generate. The next question is how to forcefully control what they generate, injecting strict structural constraints into the generative process.

Week 10: ControlNet and Controlled Generation. We examine how structural conditioning (edges, depth maps, pose skeletons) is used to mathematically steer diffusion-based models, how zero-initialized control networks integrate with frozen backbones, and how VLMs serve as high-level semantic controllers that translate language into structured signals for physical action.

Purpose of this lecture#

Captioning Metrics: The problem of scoring language#

BLEU-4 (Bilingual Evaluation Understudy)#

\text{BLEU-4} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{4} w_n \log p_n\right)

where $p_n$ is the clipped $n$ -gram precision, $w_n = 1/4$ are uniform weights, and $\text{BP}$ is the Brevity Penalty:

\text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - r/c) & \text{if } c \leq r \end{cases}

(where $c$ is the length of the candidate and $r$ is the reference length).

CIDEr (Consensus-based Image Description Evaluation)#

w_k(c_i) = \frac{n_k(c_i)}{\sum_{\omega \in \Omega} n_\omega(c_i)} \times \log\left(\frac{|I|}{\sum_{I_j \in I} \min(1, \sum_{\omega \in c_i} \mathbb{1}_{\{\omega=k\}})}\right)

SPICE (Semantic Propositional Image Caption Evaluation)#

VQA Evaluation and Linguistic Priors#

Visual Question Answering (VQA) evaluation appears straightforward—compare the generated answer to the ground-truth string—but it hides deep structural biases.

\text{Acc}(\hat{a}) = \min\!\left(1, \frac{\sum_{i=1}^{10} \mathbb{1}_{\{a_i = \hat{a}\}}}{3}\right)

An answer receives full credit ( $1.0$ ) if at least 3 out of 10 humans agreed with it.

The Problem of Linguistic Priors#

Grounding evaluation#

Visual grounding tasks (Mapping text to specific bounding boxes) are evaluated with rigid localization metrics.

Intersection over Union (IoU): For a predicted bounding box $\hat{B}$ and ground-truth box $B$ , the Jaccard index is:

\text{IoU}(\hat{B}, B) = \frac{\text{Area}(\hat{B} \cap B)}{\text{Area}(\hat{B} \cup B)}

Acc@IoU $\theta$ : The standard metric is binary accuracy, where a prediction is deemed correct if $\text{IoU} \geq \theta$ (typically $\theta = 0.5$ ).

Distribution Shift: The Benchmark-Reality Gap#

Robustness Evaluation Methodology#

Rigorous VLM engineering requires abandoning standard test sets in favor of adversarial evaluation:

Key takeaways#

Conceptual questions#

CIDEr TF-IDF Mathematics: Consider a generated caption that consists purely of five highly specific, rare nouns concatenated together (e.g., "stethoscope scalpel syringe clipboard doctor") with zero grammatical structure. Using the TF-IDF formulation of CIDEr, explain mathematically why this garbage string might achieve a higher CIDEr score than a perfectly grammatical sentence ("A medical professional is holding a stethoscope in the hospital"). If you were using Reinforcement Learning to optimize a VLM's captioning head, how would you modify the reward function to prevent this specific "CIDEr-hacking" collapse?
VQA Accuracy and Annotator Variance: A VQA v2 question asks "What color is the car?" 4 humans answered "maroon", 4 humans answered "burgundy", and 2 answered "dark red". The VLM predicts "red". Calculate the exact VQA accuracy score for the VLM's prediction. Explain why evaluating open-ended VQA tasks using exact string-matching against human annotations structurally penalizes VLMs for utilizing a broader, more descriptive vocabulary than the average annotator.
Spurious Correlations vs. Causal Features: A VQA model is trained on a dataset where 98% of questions asking "What is the person riding?" correspond to images containing snow, and the answer is "snowboard." You deploy the model, and when shown an image of a person riding a skateboard on concrete, it predicts "snowboard." Formulate a mathematical argument defining whether the model learned $P(\text{snowboard} \mid \text{riding})$ (a language prior) or $P(\text{snowboard} \mid \text{white background pixels})$ (a visual spurious correlation). Design a specific contrast set of two images that would definitively prove which shortcut the model is exploiting.
IoU Thresholds in Robotics: A VLM grounding model achieves 92% Acc@IoU=0.5 on a benchmark. A robotics team integrates this model to direct a robotic arm to grasp a coffee mug by its handle. During physical trials, the robot repeatedly smashes its gripper into the main body of the mug. Draw a 2D bounding box diagram demonstrating how a predicted bounding box can mathematically achieve an IoU of $0.55$ with the ground truth, yet completely exclude the spatial coordinates of the mug's handle. Why must physical AI systems evaluate grounding models using point-wise precision or Acc@IoU=0.9?
Typographic Attacks on the Joint Space: You apply a typographic attack by pasting a piece of paper reading "STOP" onto a speed limit sign. A CLIP-based zero-shot classifier predicts the image is a stop sign. Using the mathematics of the joint image-text embedding space (from Week 3), explain exactly why the visual encoder $f_\theta$ mapped the raw pixels of the word "STOP" to a vector that achieved a higher cosine similarity with the text embedding for "A photo of a stop sign" than the actual physical geometry of the speed limit sign did.

Solutions

CIDEr hacking. CIDEr weights n-gram matches by TF-IDF, so rare high-IDF nouns dominate the score; a string of correct rare nouns can outscore a fluent sentence that "spends" matches on low-IDF function words. To stop an RL captioner from exploiting this, add a fluency term to the reward — a language-model likelihood, BERTScore, or a grammaticality discriminator — so reward is not pure n-gram overlap.
VQA accuracy. VQA score is $\min(\text{matching annotators}/3,\ 1)$ . "red" matches none of maroon/burgundy/dark red, so the score is $0/3 = 0$ . Exact string matching penalizes correct-but-differently-worded answers, structurally punishing a richer vocabulary than the annotator consensus.
Spurious correlation. Build a contrast set: image A is a skateboard on snow/white background, image B is a snowboard on concrete/dark background. If predictions track the background (snow → "snowboard") the model learned $P(\text{snowboard}\mid \text{white pixels})$ , the visual shortcut; if they track the object regardless of background it learned the causal feature. The two images isolate which cue drives the prediction.
IoU vs handle. A predicted box shifted onto the mug body can overlap the ground-truth box enough for IoU $0.55$ — the large body dominates both intersection and union — while entirely excluding the thin handle region. IoU rewards bulk-area overlap, not part coverage, so physical grasping (which needs the handle's coordinates) must evaluate with point-wise precision or a high threshold like Acc@IoU=0.9.
Typographic attack. CLIP learned that pixels spelling "STOP" co-occur with stop-sign captions, so its encoder maps the rendered word to a region of the joint space near the "a photo of a stop sign" text embedding — a higher cosine similarity than the actual speed-sign geometry achieves. The model reads text semantically rather than the physical object, so the pasted word overrides the true sign.

Purpose of this lecture#

Captioning Metrics: The problem of scoring language#

BLEU-4 (Bilingual Evaluation Understudy)#

CIDEr (Consensus-based Image Description Evaluation)#

SPICE (Semantic Propositional Image Caption Evaluation)#

VQA Evaluation and Linguistic Priors#

The Problem of Linguistic Priors#

Grounding evaluation#

Distribution Shift: The Benchmark-Reality Gap#

Robustness Evaluation Methodology#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 9: Evaluation and Robustness

Purpose of this lecture#

Captioning Metrics: The problem of scoring language#

BLEU-4 (Bilingual Evaluation Understudy)#

CIDEr (Consensus-based Image Description Evaluation)#

SPICE (Semantic Propositional Image Caption Evaluation)#

VQA Evaluation and Linguistic Priors#

The Problem of Linguistic Priors#

Grounding evaluation#

Distribution Shift: The Benchmark-Reality Gap#

Robustness Evaluation Methodology#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#