Week 14: Generative AI Capstone — Generative Models

Purpose of this lecture#

Thirteen weeks assembled a complete generative modeling curriculum: probabilistic foundations, VAEs, GANs, EBMs, normalizing flows, DDPM, flow matching, conditioning, latent diffusion, evaluation, representation learning, world models, and safety. Each lecture addressed a specific component of the generative modeling pipeline.

This final lecture addresses the harder question: what does it mean to build and evaluate a generative system with scientific rigor? A compelling demo is not evidence of capability; a single FID number is not evidence of general quality; an ablation that was not pre-registered is not evidence of necessity. This lecture develops the experimental discipline that separates published research from exploratory tinkering — and that determines whether a deployed system will perform as reported when exposed to the full distribution of real-world inputs.

The generative pipeline as a system#

A generative system is not a sequence of independent design choices — it is an interconnected system where decisions in one component propagate to others in ways that are often nonlinear and non-obvious.

Data Pipeline Decisions#

Data pipeline decisions determine what distribution the model will learn. Deduplication rates affect memorization; image resolution affects spatial fidelity; caption quality for text-to-image models determines semantic grounding. A model cannot learn structure that is absent from or inconsistent in the training data, no matter the architecture.

Architecture Decisions#

Architecture decisions determine the model's capacity and inductive biases. A diffusion transformer (DiT) operating on image patches has no inherent spatial locality bias — it must learn spatial structure from data. A convolutional U-Net has strong locality bias built in, which accelerates learning but may limit long-range coherence. The right architecture depends on the dataset scale and the downstream use case.

Training Decisions#

Training decisions determine whether the architecture's capacity is utilized. Noise schedule, learning rate schedule, batch size, EMA coefficient, and gradient clipping all affect the resulting model. The interaction between noise schedule and model capacity is particularly important: a model undertrained at high noise levels will produce globally incoherent samples even if reconstruction at low noise levels is sharp.

Inference Decisions#

Inference decisions determine sample quality for the deployed model. Guidance scale, number of sampling steps, sampler choice (DDIM, DPM-Solver, flow ODE), and classifier-free guidance dropout rate all affect the fidelity-diversity tradeoff. The optimal inference configuration is task-dependent.

Experimental methodology#

The scientific method for generative model research requires treating experimental design as carefully as model design.

Controlled Comparisons#

Controlled comparisons: changing more than one variable between compared systems prevents attribution of performance differences to any single cause. The gold standard is an ablation study: start from the full system, remove one component at a time, and measure the performance impact of each removal. A well-designed ablation answers: which components are necessary (large performance drop when removed), which are redundant (small performance drop), and which interact (removing A and B together costs more than removing A or B alone).

Reproducibility Requirements#

Reproducibility requirements: fixed random seeds for all stochastic evaluations; exact specification of the evaluation protocol (number of samples, FID implementation, guidance scale); checkpoint identification (training step, effective batch size, hyperparameters). Results without these specifications cannot be reproduced, compared across papers, or validated.

Baseline Selection#

Baseline selection: a result is only meaningful relative to what was already possible. The appropriate baselines are: (1) a simple baseline that requires minimal implementation (a VAE or a DDPM with standard settings); (2) the best published method on the same benchmark; (3) the ablated version of the proposed system that isolates the novel contribution. A method that outperforms only weak baselines provides weak evidence.

Avoiding Cherry-Picking#

Cherry-picking — selecting the best samples for visual presentation — is the most common source of misleading results in generative model papers. Quantitative metrics (FID, precision/recall) computed on random samples rather than curated selections provide protection against cherry-picking, but even quantitative metrics can be optimized specifically during development while hiding failures in other evaluation dimensions.

Scaling laws for generative models#

Performance improvements in generative models follow predictable power-law relationships that guide resource allocation and enable informed architectural decisions.

Neural Scaling Laws#

Neural scaling laws (Hoffmann et al., 2022; Chen et al., 2023 for diffusion) describe how performance metrics improve predictably with model scale, data scale, and compute. For language models, the Chinchilla scaling law provides: $L(N, D) = E + A/N^\alpha + B/D^\beta$ where $N$ is the number of parameters, $D$ is the number of training tokens, and $E$ is the irreducible loss. This law reveals that performance depends on both model and data scale in a balanced way, not dominated by either alone.

Diffusion Model Scaling#

Diffusion model scaling exhibits similar power-law relationships. FID scales approximately as $N^{-0.5}$ with model parameters (holding data and compute fixed), and as $D^{-0.3}$ with training data. These exponents are empirical observations from large-scale training studies that guide resource allocation — doubling parameters and doubling data have different returns depending on the current operating point. A model operating at the parameter-limited regime (trained to convergence on fixed data) benefits more from additional parameters, while a data-limited regime (parameters far from saturation) benefits more from additional training examples.

Compute-Optimal Frontier#

Compute-optimal frontier (Kaplan scaling law; Chinchilla refinement): for a fixed FLOPs budget $C$ , the optimal parameter count $N^*$ and dataset size $D^*$ satisfy $N^* \propto C^{0.5}$ and $D^* \propto C^{0.5}$ . This implies equal allocation to model and data — spending all compute on a very large model trained for few steps (overparameterized regime), or a small model trained for many steps (undertrained regime), both underperform the balanced allocation. For diffusion models, this principle is particularly important for text-to-image systems where the cost of data is training compute, not acquisition.

Practical Implications#

Practical implication for generative model research: before proposing a new architecture, establish the scaling baseline. The critical question is: does the proposed method improve FID at the same parameter count as the baseline, or does it achieve the same FID with fewer parameters? Single-point comparisons at fixed parameter counts are uninformative if the scaling curves differ. Scaling curves are more informative than single-point comparisons and enable proper architectural attribution.

Saturation and Emergence#

Saturation and capability emergence: at certain scale thresholds, qualitatively new capabilities emerge that are absent at smaller scales. Text-to-image models show emergent prompt binding — the ability to correctly associate multiple attributes with multiple objects (e.g., "a red sphere and a blue cube") — only above approximately 1B parameters. Video diffusion models show temporal coherence emergence around 10B parameters, where the model transitions from generating frame-by-frame incoherent motion to smooth, continuous trajectories. These emergence points are not predicted by smooth scaling laws and require empirical verification at scale, making early-stage scaling studies essential for long-term research planning.

Ablation design#

A minimal ablation set for a text-to-image latent diffusion system should answer:

Latent Compression#

Is the latent compression necessary? Compare pixel-space diffusion at the same compute budget. This quantifies the efficiency gain of LDM compression.

Conditioning Mechanism#

Is the conditioning mechanism necessary? Compare CLIP text conditioning vs. no conditioning vs. a simple class label. This quantifies the semantic grounding provided by the conditioning.

Classifier-Free Guidance#

Is CFG necessary at the chosen guidance scale? Compare $s = 1$ (no guidance), $s = 5$ (standard), and $s = 15$ (high guidance). This quantifies the fidelity-diversity tradeoff and identifies the guidance scale where the improvement saturates.

Noise Schedule#

Is the noise schedule optimal? Compare linear vs. cosine schedules. This quantifies the effect of schedule design on sample quality at different generation budgets.

Inference Steps#

Does sample quality scale with inference steps? Measure FID at 5, 10, 20, 50, and 250 sampling steps. This characterizes the quality-compute tradeoff and identifies the knee of the curve.

Capstone project tracks#

The course capstone is an opportunity to apply experimental rigor to a generative modeling problem of your choice. Three project tracks are provided below, each with specific technical specifications. A well-executed project on a smaller scale is more valuable than an incomplete project at large scale.

| Track A: Arch | Track B: Eval | Track C: Apps | | --- | --- | --- | | Rigorously evaluate a modification to the DDPM or flow matching training pipeline with controlled ablations. | Design a new evaluation framework (TIFA-style or VQAScore-style) for a specific generative sub-domain. | Deploy a generative model to a downstream task like robotics, medical imaging, or audio synthesis. |

Track A — Architecture and Training Ablation.

Goal: rigorously evaluate a proposed modification to the DDPM or flow matching training pipeline.

Requirements: (1) Implement a baseline DDPM or flow matching model on a standard dataset (CIFAR-10, CelebA-HQ 64×64, or LSUN bedroom). (2) Propose and implement one architectural or training change (e.g., modified noise schedule, alternative conditioning architecture, latent compression ratio). (3) Run a controlled ablation: full system vs. system without the proposed change vs. simple baseline. (4) Report FID with 95% confidence intervals at 3 sample sizes ( $N = 5000, 20000, 50000$ ). (5) Generate qualitative failure examples that illustrate where each model breaks down.

Deliverables: training code, evaluation script with fixed seeds, FID curves with confidence bands, and 10–15 generated samples per configuration.

Track B — Conditioning and Control.

Goal: implement and evaluate a conditioning mechanism on a specialized domain.

Requirements: (1) Collect or curate a small dataset with conditioning signal (sketch-to-photo, pose-to-person, or audio-to-spectrogram; 500–2000 samples is sufficient). (2) Implement either ControlNet-style structural conditioning or classifier-free guidance on a pretrained base model. (3) Evaluate three metrics: image quality (FID), conditioning adherence (task-specific metric such as SSIM for structural correspondence or pose distance for pose-to-person), and failure rate at high guidance scales (e.g., at $s = 20$ where diversity breaks down). (4) Document the guidance scale tradeoff curve — generate samples at $s \in \{1, 3, 5, 10, 15, 20\}$ and plot adherence vs. diversity.

Deliverables: conditioning implementation, evaluation dataset with ground truth conditioning signals, guidance curve plots, and qualitative examples across the guidance spectrum.

Track C — Generative Models for Physical AI.

Goal: apply a generative model to a physical AI task (robotics, autonomous driving, or sequential decision making).

Requirements: (1) Train a diffusion policy or world model on a simulated environment (D4RL locomotion task, Isaac Gym manipulation, or a custom 2D gridworld environment). (2) Evaluate on task success rate with 95% confidence intervals over 10+ random seeds. (3) Ablate the generative model component — compare to a non-generative baseline (behavioral cloning, PPO without world model) to demonstrate the value of generative modeling. (4) Analyze failure cases: categorize when the generative model fails (e.g., distribution shift, long horizon, sparse reward regime) and estimate failure rates for each category.

Deliverables: training script, evaluation protocol with fixed seeds, success rate curves with confidence intervals, ablation comparison, failure analysis with 5–10 visualized failure trajectories.

For all tracks: use fixed random seeds for all stochastic components, report all hyperparameters (learning rate, batch size, number of steps, network architecture), provide all generated samples in an appendix or supplementary materials. A well-executed Track A project on CIFAR-10 with 50k evaluation samples is more scientifically valuable than an under-documented Track C project with impressive but unvalidated results at large scale.

Multi-dimensional evaluation#

A complete evaluation of a generative system requires multiple metrics reported jointly, because each metric captures a different aspect of performance and can be independently optimized at the expense of the others.

For text-to-image systems:

FID on a standard benchmark (COCO-30K or MS-COCO) with 50,000 samples measures the overall fidelity and diversity of the joint distribution. Report with 95% confidence intervals from bootstrap sampling.

Precision and recall decompose FID into quality (precision) and coverage (recall). A system with high precision but low recall is a mode-collapsed model that produces realistic but narrow outputs.

CLIP Score measures prompt adherence. A system can achieve low FID and high CLIP Score by memorizing the evaluation set, so both should be verified against a training-set-disjoint evaluation split.

DrawBench / PartiPrompts evaluation uses a curated prompt set covering challenging semantic compositions (counting, spatial relationships, attribute binding) and reports human preference win rates against baselines. This captures capabilities that FID and CLIP Score miss.

Safety evaluation reports the rate of harmful content generation under a standardized red-teaming prompt set, with and without safety filters. Safety evaluation must be on-distribution for the deployment context.

Systematic failure analysis#

Every generative system fails on some inputs. Documenting failures systematically is a positive scientific contribution — it tells practitioners where the system will not perform and guides future work. Failure modes should be categorized and their rates estimated:

| Attr Binding | Counting | Spatial Rel | | --- | --- | --- | | "Red cube on a blue sphere" generates objects with swapped colors, indicating a failure in attribute grounding. | "Three dogs" generates two or four, revealing limitations in the model's numeracy and instance separation. | "Object A left of B" generates the wrong arrangement, failing to model relative coordinates and composition. |

Attribute binding failures: "a red cube on a blue sphere" generates the right objects with wrong color assignments. Frequency on standard benchmarks.

Counting failures: "three dogs" generates two or four dogs. Rate as a function of object count.

Spatial relationship failures: "object A to the left of object B" generates the wrong spatial arrangement. Rate by relationship type.

Style-content entanglement: requesting a photorealistic image in a specific artistic style produces an image that is neither photorealistic nor artistically consistent.

For each failure mode, document: the frequency in standard evaluation, the input conditions that trigger it, and whether it would be detected by standard automatic metrics (FID, CLIP Score).

Deployment engineering: quantization, distillation, and production inference#

Deployment of generative models to production systems requires careful attention to inference latency, memory footprint, and reliability under load. While training-time considerations focus on quality, deployment-time considerations focus on efficiency and robustness.

Quantization#

Quantization reduces model weights from 32-bit floating point (float32) to lower precision representations to reduce memory footprint and increase inference throughput. INT8 quantization maps weights to 8-bit integers using per-channel or per-layer scaling: $w_\text{int8} = \text{round}(w / s)$ where $s = \max(|w|) / 127$ ; dequantize for computation: $w \approx w_\text{int8} \cdot s$ . INT8 quantization reduces memory by 4× and increases throughput by 2–3× on modern hardware with GEMM INT8 acceleration (e.g., TensorRT on NVIDIA GPUs). Quality degradation is typically <1 FID point for diffusion models, making INT8 the practical choice for deployment where memory is constrained.

4-bit Quantization#

4-bit quantization (GGUF format for LLMs, GPTQ for weights) performs aggressive quantization to 4 bits per weight, reducing memory by 8× but requiring careful per-channel or per-group scaling to limit accuracy degradation. 4-bit quantization is primarily used for large language models but is increasingly applied to diffusion transformer models where memory becomes a bottleneck (e.g., DiT-XL models on edge devices). Calibration on a representative sample of training data is essential to maintain quality.

Knowledge Distillation#

Knowledge distillation for diffusion trains a small student diffusion model to match the output distribution of a large teacher using feature-matching loss: $\mathcal{L}_\text{distill} = \mathbb{E}_t[\|\phi_l^\text{student}(x_t) - \phi_l^\text{teacher}(x_t)\|^2]$ where $\phi_l$ are intermediate features at layer $l$ . The student model (e.g., 100M parameters) achieves comparable FID to the teacher (1B parameters) on standard benchmarks through distillation, reducing inference cost by 10×. Distillation is especially effective when combined with consistency distillation, where the student is trained to match single-step or multi-step teacher outputs, enabling very fast generation (4–8 steps for high-quality output).

ONNX and TensorRT Export#

ONNX and TensorRT export: for production deployment to diverse hardware, export the trained model to ONNX (Open Neural Network Exchange) for hardware-agnostic representation. Compile with TensorRT for GPU inference acceleration, which applies graph fusion, operator optimization, and precision calibration automatically. A production text-to-image system (Stable Diffusion 1.5 on an A10G GPU) generates 512×512 images in 0.8s with DDIM-25 sampling steps after TensorRT compilation, compared to 3.5s with PyTorch eager mode — a 4.4× speedup from compilation alone.

Dynamic Batching and Streaming#

Dynamic batching and streaming: production serving systems (NVIDIA Triton, vLLM, Replicate, BentoML) batch multiple user requests to maximize GPU utilization and reduce per-request latency. For diffusion models, batching across requests is straightforward since each inference is independent and stateless. Monitoring P95 latency and error rates under realistic load is essential before public deployment. A typical target is P95 latency < 5s for a 512×512 image generation on a single GPU, achievable with batch size 4–8 and DDIM-25 steps.

GenAI context#

This course on generative modeling for physical AI is part of a broader four-course curriculum. The table below maps Course 3 topics to foundational concepts from Course 1 (reinforcement learning) and Course 2 (robot learning), and to downstream applications in Course 4 (vision-language models).

| Course 3 topic | Connection to Course 1 (RL) | Connection to Course 2 (Robotics) | Connection to Course 4 (VLMs) | |---|---|---|---| | VAE (Week 2) | Variational inference in model-based RL; latent Q-learning | CVAE for robot action generation (Week 8) | Visual tokenization in VLMs; discrete latent bottlenecks | | GAN (Week 3) | Adversarial IRL / GAIL for policy learning | Sim2Real with discriminator; adversarial domain randomization (Week 7) | GAN-based image-text alignment; early DALL-E variant training | | EBM (Week 4) | Reward as energy-based model; GAIL training | EBM-based obstacle avoidance and safety constraints | VLM reward models as EBMs; preference optimization | | Normalizing flows (Week 5) | Flow-based policy distribution (normalizing flow actor) | Flow-matching diffusion policy (Week 9); exact likelihood training | Continuous visual embedding spaces; normalizing flow VLMs | | DDPM (Week 6) | Diffusion for offline RL trajectory prior; goal-conditioned planning | Diffusion policy for robot control (Week 9); multi-step action generation | DALL-E 2 image generation with CLIP; text-to-image via DDPM | | Flow matching (Week 7) | Optimal transport trajectory optimization for planning | OT-CFM robot policy (50Hz control, Week 9); continuous action policies | FLUX.1 text-to-image generation; efficient generative modeling | | Conditioning (Week 8) | Reward conditioning (goal-conditioned RL); multi-task policies | Goal-conditioned imitation learning (Week 5); contextual policies | LLaVA / BLIP-2 text-to-image conditioning; multi-modal fusion | | LDM (Week 9) | Model-based RL in latent space; latent world models | Latent policy representations; compact state abstraction | Stable Diffusion for image generation; latent diffusion VLMs | | Evaluation (Week 10) | RL evaluation: sample complexity, confidence intervals | Robot task success rate with CI; generalization metrics | VLM evaluation: BLEU, VQAScore, human preference | | Representations (Week 11) | State representations for Q-learning; feature learning | R3M, Voltron, ViT features for robot perception | CLIP, MAE, DINOv2 backbone for VLMs; foundation visual features | | World models (Week 12) | DreamerV3, TD-MPC, model-based RL in imagination | Sim2Real as world model training; environment modeling | Video prediction with VLMs; temporal reasoning | | Safety (Week 13) | RLHF, DPO for LLM alignment; safe RL constraints | Safety constraints on robots; collision avoidance, reward validation | VLM safety: InstructBLIP, red-teaming; harmful content filtering |

Synthesis#

Synthesis: Every generative model is a world model. The generative modeling curriculum reveals a fundamental insight for physical AI: the distinction between "generative models" (Course 3) and "robot learning" (Course 2) is architectural, not conceptual. A VAE that compresses robot observations to latent states is a perceptual world model. A diffusion policy that generates action sequences conditioned on observations is a generative inverse dynamics model. A flow matching model trained on robot trajectories is both a motion generator and a trajectory prior for planning. The same mathematical frameworks (ELBO, score matching, optimal transport) operate seamlessly at the intersection of both domains. This unification is what makes foundation models for physical AI (the curriculum's central theme) a coherent research program rather than an assembly of disconnected methods. The next frontier is scaling these unified models to the complexity of real-world physical interaction while maintaining safety, interpretability, and human alignment.

Course conclusion#

The generative modeling curriculum has traced a complete arc from the first principles of variational inference through the full modern stack: VAEs, GANs, energy-based models, normalizing flows, DDPM, flow matching, conditioning, latent diffusion, evaluation, representation learning, world models, safety, and alignment.

Every topic connects to the themes established at the start. The ELBO from Week 1 reappears in the DDPM ELBO, the RSSM training objective, and the VLM alignment objectives. The score function from Week 4 is the denoising target of DDPM and the vector field of flow matching. The KL divergence that regularizes the VAE latent space is the same KL that constrains policy updates in RLHF. Flow matching's optimal transport objective is the Schrödinger bridge of Week 5 in the limit of no diffusion. The representation learning of Week 11 is the foundation for the VLA models of Course 2.

The open problems that remain are the same problems the field has always faced, now at larger scale and higher stakes: long-range coherence in high-dimensional generation, formal guarantees on safety constraints, sample efficiency in low-data regimes, and the alignment of model behavior with human values under distribution shift. Understanding both what has been solved and what remains open defines technical depth in this field.

Conceptual questions#

An ablation study for a latent diffusion model shows: full system → FID 6.2; no latent compression (pixel-space) → FID 8.5; no CFG (guidance $s = 1$ ) → FID 7.1; no CLIP conditioning → FID 14.3; linear noise schedule (vs. cosine) → FID 6.8. Interpret these results: rank the components by importance, identify which result is most surprising relative to prior expectations, and propose two follow-up experiments that would deepen understanding of the most important component.
A paper reports FID = 4.1 on MS-COCO with 30,000 samples and no confidence interval. A follow-up evaluation with 50,000 samples measures FID = 5.3 for the same model. Explain the statistical mechanism that causes FID to decrease with fewer samples. Derive how the bias scales with sample size, and propose a minimum sample size for reporting FID that limits the bias to less than 0.5 FID points.
A generative model achieves CLIP Score = 32 and FID = 8 on standard evaluations but shows a 40% failure rate on spatial relationship prompts ("A is to the left of B"). The CLIP Score for spatial relationship prompts is 27 (lower than average). Design a comprehensive evaluation protocol that would detect this spatial failure mode using automatic metrics, and explain why standard CLIP Score evaluation fails to flag it.
Trace the role of the ELBO through the generative modeling curriculum: show how the VAE ELBO (Week 2), the DDPM ELBO (Week 6), and the RSSM ELBO (Week 12) are all instances of the same general bound $\log p_\theta(x) \geq \mathbb{E}_{q_\phi}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z))$ . For each model, identify what plays the role of the observation $x$ , the latent $z$ , the prior $p(z)$ , and the approximate posterior $q_\phi(z \mid x)$ .
Synthesize the full Course 3 curriculum by tracing the design decisions for a language-conditioned video generation system that is safe for public deployment. For each of the following components, identify: (a) the technical choice required, (b) the evaluation metric for that component, and (c) the failure mode if the component is absent: (i) video compression (VAE), (ii) generation model (diffusion vs. flow matching), (iii) text conditioning mechanism, (iv) safety filter, (v) content watermarking.

Solutions

Ranked by FID increase when removed: no CLIP conditioning ( $+8.1$ ) ≫ no latent compression ( $+2.3$ ) > no CFG ( $+0.9$ ) > linear vs. cosine schedule ( $+0.6$ ). So conditioning is by far the most important component. Most surprising: pixel-space diffusion is worse than the latent model at equal compute ( $+2.3$ ) — naively compression should lose information, but at fixed compute it lets the diffusion model spend capacity on semantics rather than pixels. Follow-ups on conditioning: (a) vary text-encoder quality/size (CLIP vs. T5 vs. class-label-only) to isolate the source of grounding; (b) report CLIP-Score / compositional metrics, not just FID, since conditioning mainly affects prompt adherence.
With $N<d=2048$ the generated covariance $\hat\Sigma_g$ is rank-deficient, so missing eigen-directions are estimated as zero, shrinking the trace term and biasing FID downward (FID $(30k)<$ FID $(50k)$ ); the matrix-sqrt also adds Jensen bias. The bias scales roughly as $C/N$ (FID $(N)\approx$ FID $(\infty)+C/N$ ). To keep bias $<0.5$ , use $N$ well above the feature dimension — the community standard $N=50{,}000$ — and always report confidence intervals; reporting at $30k$ without a CI is not comparable.
Build a structured spatial-relations benchmark (prompts with explicit left/right/above/below between two detectable objects), generate images, and verify each relation programmatically with an object detector (GenEval/YOLO localizes the objects and checks coordinates) or a VQA model (VQAScore/TIFA asks "is A left of B?"); report per-relation accuracy. Plain CLIP Score fails because CLIP embeds a bag-of-concepts — it registers that A, B, and "left" are present but is largely invariant to their arrangement, so a swapped layout still scores high.
The shared bound is $\log p_\theta(x)\ge\mathbb{E}_{q_\phi}[\log p_\theta(x\mid z)]-D_\text{KL}(q_\phi(z\mid x)\|p(z))$ . VAE: $x$ = image, $z$ = continuous latent code, $p(z)=\mathcal{N}(0,I)$ , $q_\phi$ = learned encoder. DDPM: $x=x_0$ , $z=x_{1:T}$ the noising chain, $p(z)$ = the fixed forward process ending at $\mathcal{N}(0,I)$ , $q_\phi=q(x_{1:T}\mid x_0)$ the (non-learned) forward posterior. RSSM: $x=o_{1:T}$ observation sequence, $z=s_{1:T}$ latent states, $p(z)$ = learned prior dynamics $p(s_t\mid s_{t-1},a_{t-1})$ , $q_\phi$ = the filtering posterior $q(s_t\mid s_{t-1},a_{t-1},o_t)$ .
(i) Video compression (VAE): choose a spatiotemporal 3D-VAE with mild KL reg; metric = reconstruction PSNR/LPIPS + temporal consistency; absent → pixel-space video diffusion is computationally infeasible. (ii) Generation model: flow matching (OT-CFM) or DDPM/DiT; metric = FID/FVD and steps-vs-quality; absent → poor sample quality or unusable inference speed. (iii) Text conditioning: cross-attention over T5/CLIP embeddings with CFG; metric = CLIP-Score/VQAScore prompt adherence; absent → no controllability, prompt ignored. (iv) Safety filter: input + output harm/NSFW classifiers tested on a red-team set; metric = harmful-content rate with/without the filter; absent → unsafe generations reach users. (v) Watermarking: an invisible, robust watermark (e.g., Stable Signature / tree-ring); metric = detection accuracy and robustness to compression/crop; absent → no provenance, misuse cannot be attributed or detected.

← Previous

Week 13: Safety, Misuse, and Alignment