Purpose of this lecture
Thirteen weeks assembled a complete generative modeling curriculum: probabilistic foundations, VAEs, GANs, EBMs, normalizing flows, DDPM, flow matching, conditioning, latent diffusion, evaluation, representation learning, world models, and safety. Each lecture addressed a specific component of the generative modeling pipeline.
This final lecture addresses the harder question: what does it mean to build and evaluate a generative system with scientific rigor? A compelling demo is not evidence of capability; a single FID number is not evidence of general quality; an ablation that was not pre-registered is not evidence of necessity. This lecture develops the experimental discipline that separates published research from exploratory tinkering — and that determines whether a deployed system will perform as reported when exposed to the full distribution of real-world inputs.
The generative pipeline as a system
A generative system is not a sequence of independent design choices — it is an interconnected system where decisions in one component propagate to others in ways that are often nonlinear and non-obvious.
Data Pipeline Decisions
Data pipeline decisions determine what distribution the model will learn. Deduplication rates affect memorization; image resolution affects spatial fidelity; caption quality for text-to-image models determines semantic grounding. A model cannot learn structure that is absent from or inconsistent in the training data, no matter the architecture.
Architecture Decisions
Architecture decisions determine the model's capacity and inductive biases. A diffusion transformer (DiT) operating on image patches has no inherent spatial locality bias — it must learn spatial structure from data. A convolutional U-Net has strong locality bias built in, which accelerates learning but may limit long-range coherence. The right architecture depends on the dataset scale and the downstream use case.
Training Decisions
Training decisions determine whether the architecture's capacity is utilized. Noise schedule, learning rate schedule, batch size, EMA coefficient, and gradient clipping all affect the resulting model. The interaction between noise schedule and model capacity is particularly important: a model undertrained at high noise levels will produce globally incoherent samples even if reconstruction at low noise levels is sharp.
Inference Decisions
Inference decisions determine sample quality for the deployed model. Guidance scale, number of sampling steps, sampler choice (DDIM, DPM-Solver, flow ODE), and classifier-free guidance dropout rate all affect the fidelity-diversity tradeoff. The optimal inference configuration is task-dependent.
Experimental methodology
The scientific method for generative model research requires treating experimental design as carefully as model design.
Controlled Comparisons
Controlled comparisons: changing more than one variable between compared systems prevents attribution of performance differences to any single cause. The gold standard is an ablation study: start from the full system, remove one component at a time, and measure the performance impact of each removal. A well-designed ablation answers: which components are necessary (large performance drop when removed), which are redundant (small performance drop), and which interact (removing A and B together costs more than removing A or B alone).
Reproducibility Requirements
Reproducibility requirements: fixed random seeds for all stochastic evaluations; exact specification of the evaluation protocol (number of samples, FID implementation, guidance scale); checkpoint identification (training step, effective batch size, hyperparameters). Results without these specifications cannot be reproduced, compared across papers, or validated.
Baseline Selection
Baseline selection: a result is only meaningful relative to what was already possible. The appropriate baselines are: (1) a simple baseline that requires minimal implementation (a VAEVariational Autoencoder or a DDPM with standard settings); (2) the best published method on the same benchmark; (3) the ablated version of the proposed system that isolates the novel contribution. A method that outperforms only weak baselines provides weak evidence.
Avoiding Cherry-Picking
Cherry-picking — selecting the best samples for visual presentation — is the most common source of misleading results in generative model papers. Quantitative metrics (FID, precision/recall) computed on random samples rather than curated selections provide protection against cherry-picking, but even quantitative metrics can be optimized specifically during development while hiding failures in other evaluation dimensions.
Scaling laws for generative models
Performance improvements in generative models follow predictable power-law relationships that guide resource allocation and enable informed architectural decisions.
Neural Scaling Laws
Neural scaling laws (Hoffmann et al., 2022; Chen et al., 2023 for diffusion) describe how performance metrics improve predictably with model scale, data scale, and compute. For language models, the Chinchilla scaling law provides: where is the number of parameters, is the number of training tokens, and is the irreducible loss. This law reveals that performance depends on both model and data scale in a balanced way, not dominated by either alone.
Diffusion Model Scaling
Diffusion model scaling exhibits similar power-law relationships. FID scales approximately as with model parameters (holding data and compute fixed), and as with training data. These exponents are empirical observations from large-scale training studies that guide resource allocation — doubling parameters and doubling data have different returns depending on the current operating point. A model operating at the parameter-limited regime (trained to convergence on fixed data) benefits more from additional parameters, while a data-limited regime (parameters far from saturation) benefits more from additional training examples.
Compute-Optimal Frontier
Compute-optimal frontier (Kaplan scaling law; Chinchilla refinement): for a fixed FLOPs budget , the optimal parameter count and dataset size satisfy and . This implies equal allocation to model and data — spending all compute on a very large model trained for few steps (overparameterized regime), or a small model trained for many steps (undertrained regime), both underperform the balanced allocation. For diffusion models, this principle is particularly important for text-to-image systems where the cost of data is training compute, not acquisition.
Practical Implications
Practical implication for generative model research: before proposing a new architecture, establish the scaling baseline. The critical question is: does the proposed method improve FID at the same parameter count as the baseline, or does it achieve the same FID with fewer parameters? Single-point comparisons at fixed parameter counts are uninformative if the scaling curves differ. Scaling curves are more informative than single-point comparisons and enable proper architectural attribution.
Saturation and Emergence
Saturation and capability emergence: at certain scale thresholds, qualitatively new capabilities emerge that are absent at smaller scales. Text-to-image models show emergent prompt binding — the ability to correctly associate multiple attributes with multiple objects (e.g., "a red sphere and a blue cube") — only above approximately 1B parameters. Video diffusion models show temporal coherence emergence around 10B parameters, where the model transitions from generating frame-by-frame incoherent motion to smooth, continuous trajectories. These emergence points are not predicted by smooth scaling laws and require empirical verification at scale, making early-stage scaling studies essential for long-term research planning.
Ablation design
A minimal ablation set for a text-to-image latent diffusion system should answer:
Latent Compression
Is the latent compression necessary? Compare pixel-space diffusion at the same compute budget. This quantifies the efficiency gain of LDM compression.
Conditioning Mechanism
Is the conditioning mechanism necessary? Compare CLIP text conditioning vs. no conditioning vs. a simple class label. This quantifies the semantic grounding provided by the conditioning.
Classifier-Free Guidance
Is CFG necessary at the chosen guidance scale? Compare (no guidance), (standard), and (high guidance). This quantifies the fidelity-diversity tradeoff and identifies the guidance scale where the improvement saturates.
Noise Schedule
Is the noise schedule optimal? Compare linear vs. cosine schedules. This quantifies the effect of schedule design on sample quality at different generation budgets.
Inference Steps
Does sample quality scale with inference steps? Measure FID at 5, 10, 20, 50, and 250 sampling steps. This characterizes the quality-compute tradeoff and identifies the knee of the curve.
Capstone project tracks
The course capstone is an opportunity to apply experimental rigor to a generative modeling problem of your choice. Three project tracks are provided below, each with specific technical specifications. A well-executed project on a smaller scale is more valuable than an incomplete project at large scale.
| Track A: Arch | Track B: Eval | Track C: Apps | | --- | --- | --- | | Rigorously evaluate a modification to the DDPM or flow matching training pipeline with controlled ablations. | Design a new evaluation framework (TIFA-style or VQAScore-style) for a specific generative sub-domain. | Deploy a generative model to a downstream task like robotics, medical imaging, or audio synthesis. |
Track A — Architecture and Training Ablation.
Goal: rigorously evaluate a proposed modification to the DDPM or flow matching training pipeline.
Requirements: (1) Implement a baseline DDPM or flow matching model on a standard dataset (CIFAR-10, CelebA-HQ 64×64, or LSUN bedroom). (2) Propose and implement one architectural or training change (e.g., modified noise schedule, alternative conditioning architecture, latent compression ratio). (3) Run a controlled ablation: full system vs. system without the proposed change vs. simple baseline. (4) Report FID with 95% confidence intervals at 3 sample sizes (). (5) Generate qualitative failure examples that illustrate where each model breaks down.
Deliverables: training code, evaluation script with fixed seeds, FID curves with confidence bands, and 10–15 generated samples per configuration.
Track B — Conditioning and Control.
Goal: implement and evaluate a conditioning mechanism on a specialized domain.
Requirements: (1) Collect or curate a small dataset with conditioning signal (sketch-to-photo, pose-to-person, or audio-to-spectrogram; 500–2000 samples is sufficient). (2) Implement either ControlNet-style structural conditioning or classifier-free guidance on a pretrained base model. (3) Evaluate three metrics: image quality (FID), conditioning adherence (task-specific metric such as SSIM for structural correspondence or pose distance for pose-to-person), and failure rate at high guidance scales (e.g., at where diversity breaks down). (4) Document the guidance scale tradeoff curve — generate samples at and plot adherence vs. diversity.
Deliverables: conditioning implementation, evaluation dataset with ground truth conditioning signals, guidance curve plots, and qualitative examples across the guidance spectrum.
Track C — Generative Models for Physical AI.
Goal: apply a generative model to a physical AI task (robotics, autonomous driving, or sequential decision making).
Requirements: (1) Train a diffusion policy or world model on a simulated environment (D4RL locomotion task, Isaac Gym manipulation, or a custom 2D gridworld environment). (2) Evaluate on task success rate with 95% confidence intervals over 10+ random seeds. (3) Ablate the generative model component — compare to a non-generative baseline (behavioral cloning, PPOProximal Policy Optimisation without world model) to demonstrate the value of generative modeling. (4) Analyze failure cases: categorize when the generative model fails (e.g., distribution shift, long horizon, sparse reward regime) and estimate failure rates for each category.
Deliverables: training script, evaluation protocol with fixed seeds, success rate curves with confidence intervals, ablation comparison, failure analysis with 5–10 visualized failure trajectories.
For all tracks: use fixed random seeds for all stochastic components, report all hyperparameters (learning rate, batch size, number of steps, network architecture), provide all generated samples in an appendix or supplementary materials. A well-executed Track A project on CIFAR-10 with 50k evaluation samples is more scientifically valuable than an under-documented Track C project with impressive but unvalidated results at large scale.
Multi-dimensional evaluation
A complete evaluation of a generative system requires multiple metrics reported jointly, because each metric captures a different aspect of performance and can be independently optimized at the expense of the others.
For text-to-image systems:
FID on a standard benchmark (COCO-30K or MS-COCO) with 50,000 samples measures the overall fidelity and diversity of the joint distribution. Report with 95% confidence intervals from bootstrap sampling.
Precision and recall decompose FID into quality (precision) and coverage (recall). A system with high precision but low recall is a mode-collapsed model that produces realistic but narrow outputs.
CLIP Score measures prompt adherence. A system can achieve low FID and high CLIP Score by memorizing the evaluation set, so both should be verified against a training-set-disjoint evaluation split.
DrawBench / PartiPrompts evaluation uses a curated prompt set covering challenging semantic compositions (counting, spatial relationships, attribute binding) and reports human preference win rates against baselines. This captures capabilities that FID and CLIP Score miss.
Safety evaluation reports the rate of harmful content generation under a standardized red-teaming prompt set, with and without safety filters. Safety evaluation must be on-distribution for the deployment context.
Systematic failure analysis
Every generative system fails on some inputs. Documenting failures systematically is a positive scientific contribution — it tells practitioners where the system will not perform and guides future work. Failure modes should be categorized and their rates estimated:
| Attr Binding | Counting | Spatial Rel | | --- | --- | --- | | "Red cube on a blue sphere" generates objects with swapped colors, indicating a failure in attribute grounding. | "Three dogs" generates two or four, revealing limitations in the model's numeracy and instance separation. | "Object A left of B" generates the wrong arrangement, failing to model relative coordinates and composition. |
Attribute binding failures: "a red cube on a blue sphere" generates the right objects with wrong color assignments. Frequency on standard benchmarks.
Counting failures: "three dogs" generates two or four dogs. Rate as a function of object count.
Spatial relationship failures: "object A to the left of object B" generates the wrong spatial arrangement. Rate by relationship type.
Style-content entanglement: requesting a photorealistic image in a specific artistic style produces an image that is neither photorealistic nor artistically consistent.
For each failure mode, document: the frequency in standard evaluation, the input conditions that trigger it, and whether it would be detected by standard automatic metrics (FID, CLIP Score).
Deployment engineering: quantization, distillation, and production inference
Deployment of generative models to production systems requires careful attention to inference latency, memory footprint, and reliability under load. While training-time considerations focus on quality, deployment-time considerations focus on efficiency and robustness.
Quantization
Quantization reduces model weights from 32-bit floating point (float32) to lower precision representations to reduce memory footprint and increase inference throughput. INT8 quantization maps weights to 8-bit integers using per-channel or per-layer scaling: where ; dequantize for computation: . INT8 quantization reduces memory by 4× and increases throughput by 2–3× on modern hardware with GEMM INT8 acceleration (e.g., TensorRT on NVIDIA GPUs). Quality degradation is typically <1 FID point for diffusion models, making INT8 the practical choice for deployment where memory is constrained.
4-bit Quantization
4-bit quantization (GGUF format for LLMs, GPTQ for weights) performs aggressive quantization to 4 bits per weight, reducing memory by 8× but requiring careful per-channel or per-group scaling to limit accuracy degradation. 4-bit quantization is primarily used for large language models but is increasingly applied to diffusion transformer models where memory becomes a bottleneck (e.g., DiT-XL models on edge devices). Calibration on a representative sample of training data is essential to maintain quality.
Knowledge Distillation
Knowledge distillation for diffusion trains a small student diffusion model to match the output distribution of a large teacher using feature-matching loss: where are intermediate features at layer . The student model (e.g., 100M parameters) achieves comparable FID to the teacher (1B parameters) on standard benchmarks through distillation, reducing inference cost by 10×. Distillation is especially effective when combined with consistency distillation, where the student is trained to match single-step or multi-step teacher outputs, enabling very fast generation (4–8 steps for high-quality output).
ONNX and TensorRT Export
ONNX and TensorRT export: for production deployment to diverse hardware, export the trained model to ONNX (Open Neural Network Exchange) for hardware-agnostic representation. Compile with TensorRT for GPU inference acceleration, which applies graph fusion, operator optimization, and precision calibration automatically. A production text-to-image system (Stable Diffusion 1.5 on an A10G GPU) generates 512×512 images in 0.8s with DDIM-25 sampling steps after TensorRT compilation, compared to 3.5s with PyTorch eager mode — a 4.4× speedup from compilation alone.
Dynamic Batching and Streaming
Dynamic batching and streaming: production serving systems (NVIDIA Triton, vLLM, Replicate, BentoML) batch multiple user requests to maximize GPU utilization and reduce per-request latency. For diffusion models, batching across requests is straightforward since each inference is independent and stateless. Monitoring P95 latency and error rates under realistic load is essential before public deployment. A typical target is P95 latency < 5s for a 512×512 image generation on a single GPU, achievable with batch size 4–8 and DDIM-25 steps.
GenAI context
This course on generative modeling for physical AI is part of a broader four-course curriculum. The table below maps Course 3 topics to foundational concepts from Course 1 (reinforcement learning) and Course 2 (robot learning), and to downstream applications in Course 4 (vision-language models).
| Course 3 topic | Connection to Course 1 (RLReinforcement Learning) | Connection to Course 2 (Robotics) | Connection to Course 4 (VLMs) | |---|---|---|---| | VAEVariational Autoencoder (Week 2) | Variational inference in model-based RLReinforcement Learning; latent Q-learning | CVAE for robot action generation (Week 8) | Visual tokenization in VLMs; discrete latent bottlenecks | | GAN (Week 3) | Adversarial IRL / GAIL for policy learning | Sim2Real with discriminator; adversarial domain randomization (Week 7) | GAN-based image-text alignment; early DALL-E variant training | | EBM (Week 4) | Reward as energy-based model; GAIL training | EBM-based obstacle avoidance and safety constraints | VLMVision-Language Model reward models as EBMs; preference optimization | | Normalizing flows (Week 5) | Flow-based policy distribution (normalizing flow actor) | Flow-matching diffusion policy (Week 9); exact likelihood training | Continuous visual embedding spaces; normalizing flow VLMs | | DDPM (Week 6) | Diffusion for offline RLReinforcement Learning trajectory prior; goal-conditioned planning | Diffusion policy for robot control (Week 9); multi-step action generation | DALL-E 2 image generation with CLIP; text-to-image via DDPM | | Flow matching (Week 7) | Optimal transport trajectory optimization for planning | OT-CFM robot policy (50Hz control, Week 9); continuous action policies | FLUX.1 text-to-image generation; efficient generative modeling | | Conditioning (Week 8) | Reward conditioning (goal-conditioned RLReinforcement Learning); multi-task policies | Goal-conditioned imitation learning (Week 5); contextual policies | LLaVA / BLIP-2 text-to-image conditioning; multi-modal fusion | | LDM (Week 9) | Model-based RLReinforcement Learning in latent space; latent world models | Latent policy representations; compact state abstraction | Stable Diffusion for image generation; latent diffusion VLMs | | Evaluation (Week 10) | RLReinforcement Learning evaluation: sample complexity, confidence intervals | Robot task success rate with CI; generalization metrics | VLMVision-Language Model evaluation: BLEU, VQAScore, human preference | | Representations (Week 11) | State representations for Q-learning; feature learning | R3M, Voltron, ViT features for robot perception | CLIP, MAE, DINOv2 backbone for VLMs; foundation visual features | | World models (Week 12) | DreamerV3, TDTemporal Difference-MPC, model-based RLReinforcement Learning in imagination | Sim2Real as world model training; environment modeling | Video prediction with VLMs; temporal reasoning | | Safety (Week 13) | RLHFReinforcement Learning from Human Feedback, DPODirect Preference Optimization for LLMLarge Language Model alignment; safe RLReinforcement Learning constraints | Safety constraints on robots; collision avoidance, reward validation | VLMVision-Language Model safety: InstructBLIP, red-teaming; harmful content filtering |
Synthesis
Synthesis: Every generative model is a world model. The generative modeling curriculum reveals a fundamental insight for physical AI: the distinction between "generative models" (Course 3) and "robot learning" (Course 2) is architectural, not conceptual. A VAEVariational Autoencoder that compresses robot observations to latent states is a perceptual world model. A diffusion policy that generates action sequences conditioned on observations is a generative inverse dynamics model. A flow matching model trained on robot trajectories is both a motion generator and a trajectory prior for planning. The same mathematical frameworks (ELBO, score matching, optimal transport) operate seamlessly at the intersection of both domains. This unification is what makes foundation models for physical AI (the curriculum's central theme) a coherent research program rather than an assembly of disconnected methods. The next frontier is scaling these unified models to the complexity of real-world physical interaction while maintaining safety, interpretability, and human alignment.
Course conclusion
The generative modeling curriculum has traced a complete arc from the first principles of variational inference through the full modern stack: VAEs, GANs, energy-based models, normalizing flows, DDPM, flow matching, conditioning, latent diffusion, evaluation, representation learning, world models, safety, and alignment.
Every topic connects to the themes established at the start. The ELBO from Week 1 reappears in the DDPM ELBO, the RSSM training objective, and the VLMVision-Language Model alignment objectives. The score function from Week 4 is the denoising target of DDPM and the vector field of flow matching. The KL divergence that regularizes the VAEVariational Autoencoder latent space is the same KL that constrains policy updates in RLHFReinforcement Learning from Human Feedback. Flow matching's optimal transport objective is the Schrödinger bridge of Week 5 in the limit of no diffusion. The representation learning of Week 11 is the foundation for the VLA models of Course 2.
The open problems that remain are the same problems the field has always faced, now at larger scale and higher stakes: long-range coherence in high-dimensional generation, formal guarantees on safety constraints, sample efficiency in low-data regimes, and the alignment of model behavior with human values under distribution shift. Understanding both what has been solved and what remains open defines technical depth in this field.
Conceptual questions
-
An ablation study for a latent diffusion model shows: full system → FID 6.2; no latent compression (pixel-space) → FID 8.5; no CFG (guidance ) → FID 7.1; no CLIP conditioning → FID 14.3; linear noise schedule (vs. cosine) → FID 6.8. Interpret these results: rank the components by importance, identify which result is most surprising relative to prior expectations, and propose two follow-up experiments that would deepen understanding of the most important component.
-
A paper reports FID = 4.1 on MS-COCO with 30,000 samples and no confidence interval. A follow-up evaluation with 50,000 samples measures FID = 5.3 for the same model. Explain the statistical mechanism that causes FID to decrease with fewer samples. Derive how the bias scales with sample size, and propose a minimum sample size for reporting FID that limits the bias to less than 0.5 FID points.
-
A generative model achieves CLIP Score = 32 and FID = 8 on standard evaluations but shows a 40% failure rate on spatial relationship prompts ("A is to the left of B"). The CLIP Score for spatial relationship prompts is 27 (lower than average). Design a comprehensive evaluation protocol that would detect this spatial failure mode using automatic metrics, and explain why standard CLIP Score evaluation fails to flag it.
-
Trace the role of the ELBO through the generative modeling curriculum: show how the VAEVariational Autoencoder ELBO (Week 2), the DDPM ELBO (Week 6), and the RSSM ELBO (Week 12) are all instances of the same general bound . For each model, identify what plays the role of the observation , the latent , the prior , and the approximate posterior .
-
Synthesize the full Course 3 curriculum by tracing the design decisions for a language-conditioned video generation system that is safe for public deployment. For each of the following components, identify: (a) the technical choice required, (b) the evaluation metric for that component, and (c) the failure mode if the component is absent: (i) video compression (VAEVariational Autoencoder), (ii) generation model (diffusion vs. flow matching), (iii) text conditioning mechanism, (iv) safety filter, (v) content watermarking.