Week 10: Evaluating Generative Models

Purpose of this lecture#

Generative models cannot be evaluated with a single accuracy number — there is no ground-truth output to compare against. Evaluation requires measuring the quality and diversity of an entire distribution, which requires different tools. This lecture examines the main metrics used in practice: bits-per-dimension (BPD) for likelihood-based models, Fréchet Inception Distance (FID) for sample quality and diversity, Inception Score, and the precision-recall framework that separately quantifies fidelity and coverage. Understanding how these metrics fail — and in what regimes they mislead — is as important as knowing how to compute them.

Bits-per-dimension#

Bits-per-dimension (BPD) measures the average number of bits required to encode a test example under the model, in units of bits per input dimension:

\text{BPD} = -\frac{\log_2 p_\theta(x)}{D}

where $D$ is the number of input dimensions (e.g., $D = 3 \times H \times W$ for an RGB image). BPD is a direct measure of the negative log-likelihood normalized by dimensionality, enabling comparison across different data types and resolutions.

For DDPM-based models, the ELBO provides a tractable upper bound on BPD: minimizing $\mathcal{L}_\text{simple}$ minimizes a weighted sum of ELBO terms. For flow models, the exact log-likelihood is computable. For VAEs, the ELBO lower-bounds $\log p_\theta(x)$ and hence upper-bounds BPD. For GANs, BPD is undefined without a separate density estimation step.

Limitations of BPD: BPD rewards models that assign high likelihood to all test examples, including imperceptible details and low-frequency artifacts that are invisible to humans. A model can achieve low BPD by memorizing training data with minor perturbations. Conversely, a model that generates highly realistic images may have worse BPD than a model that generates blurry but high-entropy images, because entropy increases log-likelihood. BPD is not well-correlated with human perceptual quality.

Fréchet Inception Distance#

FID (Heusel et al., 2017) measures the distance between the distribution of features extracted from real images and generated images, where the features are from a pretrained Inception-v3 network. The feature distributions are modeled as Gaussians, and FID is the Fréchet distance between these two Gaussians:

\text{FID} = \|\mu_r - \mu_g\|^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)

where $(\mu_r, \Sigma_r)$ and $(\mu_g, \Sigma_g)$ are the mean and covariance of the real and generated feature distributions. Lower FID indicates more similar feature distributions (better quality).

Computation: (1) Extract Inception-v3 features from $N$ real images and $N$ generated images at the second-to-last pooling layer (2048-dimensional); (2) Fit Gaussian moments; (3) Compute the Fréchet distance. Typically $N = 50{,}000$ samples; smaller $N$ introduces significant statistical bias.

FID properties: FID captures both quality (generated images with artifacts have features that differ from real features) and diversity (a mode-collapsed generator that produces only one type of image has low-variance features, far from the high-variance real distribution). FID also captures the alignment between generated and real distributions: a model that produces realistic but off-distribution images (e.g., a model trained on faces generating alien faces) will have high FID.

FID limitations: the Gaussian approximation to the feature distribution is inaccurate for multimodal distributions. Features from Inception-v3 (trained on ImageNet classification) may not capture aesthetically important properties for specialized domains (medical images, satellite imagery). FID scores are not comparable across evaluation protocols (different $N$ , different feature extractors, different real data splits).

FID statistical analysis: sample size effects#

FID computed with finite samples $N$ suffers from variance (randomness in which samples are drawn) and bias (the sample covariance estimator underestimates true covariance). Both effects impact published FID values and must be understood for fair comparison.

Variance scales as $O(1/\sqrt{N})$ ; using $N = 5{,}000$ instead of $N = 50{,}000$ increases variance by $\sqrt{10} \approx 3.16$ . This variance is rarely reported.

Bias arises because the matrix square root in the Fréchet distance is nonlinear: while $\hat\Sigma$ is an unbiased estimator of $\Sigma$ , we have $\mathbb{E}[\hat\Sigma^{1/2}] \neq \Sigma^{1/2}$ (Jensen's inequality). When $N < d = 2048$ , the sample covariance is rank-deficient. Eigenvalues in rank-deficient dimensions are estimated as zero, artificially reducing the trace term $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ and causing FID to be biased toward zero. For example: $\text{FID}(N=2048) < \text{FID}(N=10000) < \text{FID}(N=50000)$ for the same model, with differences entirely due to bias.

Practical recommendation (Bińkowski et al., 2021): Always use $N \geq 50{,}000$ . Many papers use $N = 10{,}000$ (biased) or $N = 2{,}048$ (severely biased). Results at different sample sizes are not comparable. To estimate bias, compute FID at multiple sizes and extrapolate via $\text{FID}(N) \approx \text{FID}(\infty) + C/N$ .

Inception Score#

Inception Score (IS) (Salimans et al., 2016) measures two properties using an Inception-v3 classifier: (1) each generated image should be sharply classifiable (the conditional $p(y \mid x)$ should be peaked); (2) generated images should collectively cover many classes (the marginal $p(y) = \int p(y \mid x) p_g(x) dx$ should be uniform):

\text{IS} = \exp\!\left(\mathbb{E}_x\!\left[D_\text{KL}(p(y \mid x) \| p(y))\right]\right)

Higher IS indicates sharper, more diverse outputs. IS only evaluates the generated distribution, not its relationship to real data — a model that generates perfectly sharp images of a single class can have a good quality score for that class while having a poor diversity score. IS is largely replaced by FID in modern evaluations because FID directly compares generated and real distributions.

Precision and recall for generative models#

Precision and recall (Kynkäänniemi et al., 2019; Sajjadi et al., 2018) separately quantify the two components of FID:

Precision measures sample quality (what fraction of generated samples fall within the support of the real distribution):

\text{Precision} = \frac{1}{|Y_g|} \sum_{y \in Y_g} \mathbb{1}[y \in \text{support}(Y_r)]

Recall measures coverage (what fraction of the real distribution is covered by the generated distribution):

\text{Recall} = \frac{1}{|Y_r|} \sum_{y \in Y_r} \mathbb{1}[y \in \text{support}(Y_g)]

Support membership is approximated using $k$ -nearest neighbor balls in the Inception feature space. A GAN with mode collapse has high precision (all generated samples look realistic) but low recall (only a small subset of real modes are covered). A VAE typically has low precision (blurry samples outside the sharp image manifold) but higher recall (covers more of the data distribution).

Density and coverage (Naeem et al., 2020) further extend this framework to handle outliers: density measures how many real examples fall within the support of the generated distribution (weighted by proximity), and coverage checks if the nearest real example to each generated sample is within a hypersphere defined by the real distribution.

Conditional evaluation: TIFA, VQAScore, and DrawBench#

Text-to-image models present a distinct evaluation challenge: the model must generate images that are not only realistic but also faithful to the text prompt. Metrics like CLIP Score measure semantic alignment but fail to capture compositional properties such as counting, spatial relations, and attribute binding. Several recent benchmarks directly address these failures.

| TIFA | VQAScore | DrawBench | | --- | --- | --- | | Faithfulness evaluation via QA: decomposing prompts into factual questions and using VQA models to verify the generated image. | Calculating the probability that a VQA model (e.g., LLaVA) answers "Yes" to a prompt-derived question about the image. | A comprehensive dataset of 200 prompts specifically designed to test compositional reasoning and attribute binding. |

TIFA (Hu et al., 2023): Text-to-Image Faithfulness evaluation with Question Answering decomposes faithfulness evaluation into verifiable facts. The pipeline is: (1) use a large language model (e.g., GPT-3.5) to decompose the text prompt into a list of factual questions about the image (e.g., "Is there a red car?" "How many cars are in the image?" "Is the car to the left of the tree?"); (2) generate an image from the prompt using the model under evaluation; (3) use a visual question answering (VQA) model to answer each question about the generated image; (4) compute TIFA score as the fraction of questions answered correctly. This directly measures compositional faithfulness and avoids the bag-of-words limitation of CLIP embeddings. For example, on the prompt "a red car to the left of a blue truck," TIFA asks "Is the car red?", "Is there a truck?", "Is the car on the left?" and "Is the truck blue?", then checks whether a VQA model answers all four correctly. A model that generates a blue car and red truck would fail multiple questions despite having high CLIP Score.

VQAScore (Lin et al., 2024) simplifies this further by asking a single VQA model a direct yes/no question: "Does this image show [prompt]?" and using the model's confidence for "Yes" as a score between 0 and 1. While less fine-grained than TIFA, VQAScore is more robust because it does not rely on prompt decomposition and is directly validated against human preference data. VQAScore correlates better with human judgments of prompt adherence than CLIP Score on challenging compositional prompts.

DrawBench and PartiPrompts: Rather than metrics, these are curated prompt sets designed to stress-test specific model capabilities. DrawBench contains 200 diverse prompts spanning abstract concepts, specific objects, spatial relationships, and styles. PartiPrompts (partioned ImageNet prompts) covers 600 prompts organized into three difficulty levels: type I (basic object presence), type II (fine-grained attributes and counts), and type III (rare combinations and complex spatial relations). Models are ranked using human A/B preference testing on these prompt sets, providing a clearer picture of failure modes.

GenEval (Ghosh et al., 2023) provides an automated benchmark that generates images from highly structured prompts (object type, color, count, spatial position) and uses object detectors and instance segmentation to verify whether the generated image matches the specification. For example, given a prompt "a red car and two blue cars," GenEval uses YOLO to detect cars, classifies their colors using a fine-grained color classifier, counts instances, and verifies spatial relations (if specified). This is more precise than VQA but requires translation of natural language prompts into structured specifications.

Key insight: CLIP Score alone is insufficient for evaluating text-to-image models on compositional tasks because CLIP embeddings are trained for semantic similarity, not fine-grained attribute and spatial verification. A model can generate high-CLIP-Score images that violate counting, spatial arrangement, or attribute binding constraints. Comprehensive evaluation of text-to-image models therefore requires combining CLIP Score (for semantic coverage) with VQA-based metrics (for compositional accuracy) and human preference data (for overall quality).

CLIP Score and human evaluation#

CLIP Score measures the alignment between a generated image and its text prompt using CLIP embeddings:

\text{CLIP-S}(x, c) = \cos(\text{CLIP}_I(x), \text{CLIP}_T(c))

A higher CLIP Score indicates that the image semantically matches the prompt. CLIP Score is particularly useful for evaluating text-to-image models where alignment with the conditioning text is a primary objective. However, CLIP Score is bounded by CLIP's own semantic understanding and will fail to capture alignment for prompts that describe fine spatial arrangements, counting, or nuanced attribute binding.

Human evaluation remains the gold standard for perceptual quality. Methods include: ELO rankings (pairwise comparisons between models), Likert-scale ratings (quality scores 1–5), and prompt alignment surveys (does the image match this description?). Human evaluation is expensive, slow, and has high inter-rater variance, but it captures qualities that automatic metrics miss — aesthetic appeal, prompt adherence for complex scenes, artifact detection.

RLHF-based reward models trained on human preference data can serve as proxies for human evaluation in differentiable training pipelines. ImageReward, LAION Aesthetics score, and PickScore are trained reward models that predict human preference and are used both as evaluation metrics and as fine-tuning objectives.

Evaluation for robotics and embodied generation#

When generative models synthesize robot training data (sim2real transfer, domain randomization) or plan actions (diffusion policies), evaluation must measure task-relevant fidelity rather than aesthetic quality. FID on ImageNet features may be irrelevant if downstream performance depends on lighting, material properties, or object geometry accuracy.

Task success rate is the gold standard: train a policy on generated data and measure real-world task completion rate. If a model with poor FID achieves 95% success vs. a high-FID model at 80%, the low-FID model is strictly more valuable. This is expensive (robot hours), but directly measures transfer.

Domain gap metrics avoid real-world deployment: train a binary domain classifier to distinguish real from generated observations. Classifier accuracy $\approx 50\%$ indicates minimal domain gap; $\approx 95\%$ indicates large gap. Compute this on task-relevant features (e.g., from the downstream grasp detection network) rather than Inception features.

Controllability vs. realism: ControlNet-style pose-controllable generation may be more useful than high-FID diffusion with no pose control. A low-FID model with random backgrounds is less useful than higher-FID with exact pose specification for domain randomization. Evaluation should report both realism and controllability.

Diffusion policies: For action generation, evaluation metrics are task success rate and trajectory smoothness (low acceleration), not FID. Generation quality is entirely measured by downstream task performance.

Metric failure modes#

Every metric can be exploited or mislead. Common failure modes:

FID overfitting: a model that memorizes training data will achieve near-zero FID on training data but will measure as overfitting on held-out evaluation data. FID should always be computed on evaluation-set real images, not training-set images.

Mode dropping vs. mode averaging: FID penalizes both mode dropping (missing modes → high $\|\mu_r - \mu_g\|$ ) and mode averaging (generating mean-like blurry images → low $\Sigma_g$ ). However, a model that drops half the modes while generating the other half perfectly may achieve similar FID to a model that covers all modes blurrily. Recall is needed to distinguish these cases.

CLIP Score vs. visual quality: high CLIP Score does not imply high visual quality. A model can generate high-CLIP-Score images with significant visual artifacts if the text-image alignment is strong but the image realism is poor. CLIP Score should be used jointly with FID.

Evaluation sample size: FID with $N < 10{,}000$ samples has high variance and will systematically underestimate FID (bias toward zero) due to finite-sample Gaussian fitting. Published results should report confidence intervals on FID.

GenAI context#

Evaluation methodology for generative models is an active research problem that mirrors challenges across all machine learning domains. The difficulty arises not from computing any single metric, but from the fundamental problem that no single metric captures all aspects of model quality. This problem manifests distinctly in different domains, but the underlying principle is universal: Goodhart's Law.

| Evaluation concept | Course 3 (Generative Models) | Course 1 (RL) | Course 2 (Robotics) | Course 4 (VLMs) | |---|---|---|---|---| | Gold standard | Human preference (ELO) | Expected return $J(\pi) = \mathbb{E}[\sum_t r_t]$ | Task success rate on physical robot | Human VQA accuracy | | Proxy metric | FID (feature distribution) | Surrogate reward model | Sim task success rate | BLEU, CIDEr, SPICE | | Diversity measure | Recall (coverage of real modes) | State visitation entropy $H(\rho^\pi)$ | Coverage of training distribution | Diversity of generated captions | | Goodhart's Law failure | Optimizing FID → texture collapse on Inception features | Reward hacking (C1W12) | Optimizing sim reward → sim2real gap | Optimizing BLEU → degenerate captions | | Sample efficiency | FID needs $N \geq 50k$ samples | Policy evaluation variance: $O(1/(1-\gamma)^2 N)$ | Behavior cloning sample complexity (C2W5) | VQA evaluation: hundreds of task-specific questions |

Goodhart's Law and metric optimization: Directly optimizing FID causes the model to match Inception-v3 feature statistics without regard for perceptual realism — generating unusual color statistics, unnatural textures, or artifacts invisible to Inception but obvious to humans. The same principle applies in RL (reward hacking from learned proxies) and robotics (sim2real gap from simulator overfitting).

Mitigation via KL regularization: Add constraints to stay close to a reference. In RLHF: $r_\text{adjusted} = r_\theta(x) - \beta D_\text{KL}(\pi \| \pi_\text{ref})$ . For generative models, add diversity regularizers: $\mathcal{L} = -\log p_\theta(x) + \lambda_\text{FID} \cdot \text{FID}(\theta) + \lambda_\text{percep} \cdot \text{Diversity}(\theta)$ . Robust optimization requires multiple metrics and explicit constraints, not single-metric optimization.

Key takeaways#

BPD measures log-likelihood per dimension; it correlates poorly with perceptual quality. FID captures both quality and diversity but is biased toward zero at $N < 50{,}000$ due to rank-deficient covariance estimates. Precision and recall separately measure sample quality and coverage. For text-to-image models, CLIP Score is insufficient; VQA-based metrics (TIFA, VQAScore) measure compositional accuracy. For robotics, task success rate is the gold standard. Human evaluation is gold standard but expensive. All metrics have failure modes; robust evaluation requires multiple metrics and KL regularization to prevent Goodhart's Law failure.

Conceptual questions#

A model achieves FID = 5 on COCO, computed with 50,000 generated samples against 50,000 real samples. A second evaluation uses 5,000 samples of each and reports FID = 3 for the same model. Explain the statistical bias that causes the FID to decrease with smaller sample size. Derive qualitatively how the bias in the covariance estimate $\hat\Sigma_g$ affects the Fréchet distance term $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ .
A text-to-image model trained on LAION-5B achieves Precision = 0.85 and Recall = 0.45 on COCO-30K. Interpret these numbers: what does Precision = 0.85 tell you about the quality of generated images? What does Recall = 0.45 tell you about the diversity? What intervention (guidance scale, architecture change, training data) would most effectively improve Recall without sacrificing Precision?
A researcher trains a DDPM on a dataset of 1,000 face images and achieves FID = 2 on the training set but FID = 50 on a held-out test set. Propose a statistical test that would distinguish between (a) overfitting (memorization of training images) and (b) a distribution mismatch between train and test sets. What training regularization would you add to reduce overfitting in this small-data regime?
CLIP Score measures cosine similarity between image and text embeddings. Identify three properties of generated images that CLIP Score would fail to penalize: (a) one related to counting objects, (b) one related to spatial arrangement, and (c) one related to fine-grained physical realism. For each, propose an alternative automatic metric that would detect the failure.
ImageReward is a reward model trained on human pairwise preferences. If used as a fine-tuning objective (reward-weighted imitation learning on the generated images humans prefer), describe the Goodhart's Law failure mode that occurs when the model is optimized directly for ImageReward score. What safeguard (analogous to the KL penalty in RLHF for language models) would you add to prevent this failure?

Solutions

With $N<d=2048$ the sample covariance $\hat\Sigma_g$ is rank-deficient: the missing eigen-directions are estimated as zero, shrinking both the $\Sigma_g$ term and the cross term $(\Sigma_r\Sigma_g)^{1/2}$ in $\text{tr}(\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{1/2})$ , so the Fréchet distance is biased downward. Add the Jensen bias $\mathbb{E}[\hat\Sigma^{1/2}]\neq\Sigma^{1/2}$ and FID $(5000)<$ FID $(50000)$ is an artifact of sample size, not a better model.
Precision $0.85$ : 85% of generated images lie within the real-data manifold — high quality, few artifacts. Recall $0.45$ : the generator covers only ~45% of the real distribution's diversity — significant mode dropping. The most effective single lever to raise recall without hurting precision is lowering the guidance scale (high CFG trades diversity for fidelity); secondary options are more diverse training data or reduced truncation.
(a) Nearest-neighbor test: for each generated sample find its nearest training image in feature space; if distances are near-zero (near-duplicates) the model is memorizing. Compare the generated→train distance distribution to train→train. (b) Disambiguate from distribution mismatch by also computing FID between the real test set and the train set: if that is also $\approx 50$ , train and test simply differ in distribution; if it is small, the high generated-test FID indicates overfitting. Regularize small-data training with aggressive (differentiable) augmentation / ADA, weight decay, a smaller model, and early stopping.
(a) Counting — "three cats" rendered with two; CLIP's bag-of-concepts ignores count. Detect with GenEval/object-detector counts or a VQA "how many cats?" question. (b) Spatial arrangement — "cube left of sphere" with positions swapped; CLIP is largely relation-insensitive. Detect with TIFA/VQAScore spatial questions or GenEval position checks. (c) Fine physical realism — extra fingers / broken anatomy; CLIP semantic match stays high. Detect with FID, human evaluation, or a dedicated artifact/anatomy detector.
Optimizing ImageReward directly causes reward hacking: the model exploits idiosyncrasies of the reward model (oversaturation, RM-favored compositions), producing high-reward but degenerate, low-diversity images. Safeguard: a KL penalty to the pretrained reference, $r_\text{adj}=\text{ImageReward}-\beta\,D_\text{KL}(p_\theta\|p_\text{ref})$ (exactly the RLHF trick), keeping the fine-tuned model close to the base distribution — plus reward-model ensembling and early stopping.

Looking ahead#

Model evaluation closes the loop on the generation pipeline. The next lectures pivot from generating data to using generative models for downstream tasks.

Week 11: Representation Learning with Generative Models. We examine how generative pretraining produces powerful internal representations — from masked autoencoders to diffusion model feature maps — and how these representations accelerate downstream perception, robotics, and planning tasks.

Purpose of this lecture#

Bits-per-dimension#

Bits-per-dimension (BPD) measures the average number of bits required to encode a test example under the model, in units of bits per input dimension:

\text{BPD} = -\frac{\log_2 p_\theta(x)}{D}

Fréchet Inception Distance#

\text{FID} = \|\mu_r - \mu_g\|^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)

where $(\mu_r, \Sigma_r)$ and $(\mu_g, \Sigma_g)$ are the mean and covariance of the real and generated feature distributions. Lower FID indicates more similar feature distributions (better quality).

FID statistical analysis: sample size effects#

Variance scales as $O(1/\sqrt{N})$ ; using $N = 5{,}000$ instead of $N = 50{,}000$ increases variance by $\sqrt{10} \approx 3.16$ . This variance is rarely reported.

Inception Score#

\text{IS} = \exp\!\left(\mathbb{E}_x\!\left[D_\text{KL}(p(y \mid x) \| p(y))\right]\right)

Precision and recall for generative models#

Precision and recall (Kynkäänniemi et al., 2019; Sajjadi et al., 2018) separately quantify the two components of FID:

Precision measures sample quality (what fraction of generated samples fall within the support of the real distribution):

\text{Precision} = \frac{1}{|Y_g|} \sum_{y \in Y_g} \mathbb{1}[y \in \text{support}(Y_r)]

Recall measures coverage (what fraction of the real distribution is covered by the generated distribution):

\text{Recall} = \frac{1}{|Y_r|} \sum_{y \in Y_r} \mathbb{1}[y \in \text{support}(Y_g)]

Conditional evaluation: TIFA, VQAScore, and DrawBench#

CLIP Score and human evaluation#

CLIP Score measures the alignment between a generated image and its text prompt using CLIP embeddings:

\text{CLIP-S}(x, c) = \cos(\text{CLIP}_I(x), \text{CLIP}_T(c))

Evaluation for robotics and embodied generation#

Metric failure modes#

Every metric can be exploited or mislead. Common failure modes:

GenAI context#

Key takeaways#

Conceptual questions#

A model achieves FID = 5 on COCO, computed with 50,000 generated samples against 50,000 real samples. A second evaluation uses 5,000 samples of each and reports FID = 3 for the same model. Explain the statistical bias that causes the FID to decrease with smaller sample size. Derive qualitatively how the bias in the covariance estimate $\hat\Sigma_g$ affects the Fréchet distance term $\text{tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$ .
A text-to-image model trained on LAION-5B achieves Precision = 0.85 and Recall = 0.45 on COCO-30K. Interpret these numbers: what does Precision = 0.85 tell you about the quality of generated images? What does Recall = 0.45 tell you about the diversity? What intervention (guidance scale, architecture change, training data) would most effectively improve Recall without sacrificing Precision?
A researcher trains a DDPM on a dataset of 1,000 face images and achieves FID = 2 on the training set but FID = 50 on a held-out test set. Propose a statistical test that would distinguish between (a) overfitting (memorization of training images) and (b) a distribution mismatch between train and test sets. What training regularization would you add to reduce overfitting in this small-data regime?
CLIP Score measures cosine similarity between image and text embeddings. Identify three properties of generated images that CLIP Score would fail to penalize: (a) one related to counting objects, (b) one related to spatial arrangement, and (c) one related to fine-grained physical realism. For each, propose an alternative automatic metric that would detect the failure.
ImageReward is a reward model trained on human pairwise preferences. If used as a fine-tuning objective (reward-weighted imitation learning on the generated images humans prefer), describe the Goodhart's Law failure mode that occurs when the model is optimized directly for ImageReward score. What safeguard (analogous to the KL penalty in RLHF for language models) would you add to prevent this failure?

Solutions

With $N<d=2048$ the sample covariance $\hat\Sigma_g$ is rank-deficient: the missing eigen-directions are estimated as zero, shrinking both the $\Sigma_g$ term and the cross term $(\Sigma_r\Sigma_g)^{1/2}$ in $\text{tr}(\Sigma_r+\Sigma_g-2(\Sigma_r\Sigma_g)^{1/2})$ , so the Fréchet distance is biased downward. Add the Jensen bias $\mathbb{E}[\hat\Sigma^{1/2}]\neq\Sigma^{1/2}$ and FID $(5000)<$ FID $(50000)$ is an artifact of sample size, not a better model.
Precision $0.85$ : 85% of generated images lie within the real-data manifold — high quality, few artifacts. Recall $0.45$ : the generator covers only ~45% of the real distribution's diversity — significant mode dropping. The most effective single lever to raise recall without hurting precision is lowering the guidance scale (high CFG trades diversity for fidelity); secondary options are more diverse training data or reduced truncation.
(a) Nearest-neighbor test: for each generated sample find its nearest training image in feature space; if distances are near-zero (near-duplicates) the model is memorizing. Compare the generated→train distance distribution to train→train. (b) Disambiguate from distribution mismatch by also computing FID between the real test set and the train set: if that is also $\approx 50$ , train and test simply differ in distribution; if it is small, the high generated-test FID indicates overfitting. Regularize small-data training with aggressive (differentiable) augmentation / ADA, weight decay, a smaller model, and early stopping.
(a) Counting — "three cats" rendered with two; CLIP's bag-of-concepts ignores count. Detect with GenEval/object-detector counts or a VQA "how many cats?" question. (b) Spatial arrangement — "cube left of sphere" with positions swapped; CLIP is largely relation-insensitive. Detect with TIFA/VQAScore spatial questions or GenEval position checks. (c) Fine physical realism — extra fingers / broken anatomy; CLIP semantic match stays high. Detect with FID, human evaluation, or a dedicated artifact/anatomy detector.
Optimizing ImageReward directly causes reward hacking: the model exploits idiosyncrasies of the reward model (oversaturation, RM-favored compositions), producing high-reward but degenerate, low-diversity images. Safeguard: a KL penalty to the pretrained reference, $r_\text{adj}=\text{ImageReward}-\beta\,D_\text{KL}(p_\theta\|p_\text{ref})$ (exactly the RLHF trick), keeping the fine-tuned model close to the base distribution — plus reward-model ensembling and early stopping.

Looking ahead#

Model evaluation closes the loop on the generation pipeline. The next lectures pivot from generating data to using generative models for downstream tasks.

Purpose of this lecture#

Bits-per-dimension#

Fréchet Inception Distance#

FID statistical analysis: sample size effects#

Inception Score#

Precision and recall for generative models#

Conditional evaluation: TIFA, VQAScore, and DrawBench#

CLIP Score and human evaluation#

Evaluation for robotics and embodied generation#

Metric failure modes#

GenAI context#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 10: Evaluating Generative Models

Purpose of this lecture#

Bits-per-dimension#

Fréchet Inception Distance#

FID statistical analysis: sample size effects#

Inception Score#

Precision and recall for generative models#

Conditional evaluation: TIFA, VQAScore, and DrawBench#

CLIP Score and human evaluation#

Evaluation for robotics and embodied generation#

Metric failure modes#

GenAI context#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#