Skip to main content
illumin8
Courses
Week 3: Generative Adversarial Networks
Generative Models
01Week 1: Probabilistic Foundations
02Week 2: Variational Autoencoders
03Week 3: Generative Adversarial Networks
04Week 4: Energy-Based Models and Score Matching
05Week 5: Normalizing Flows
06Week 6: Denoising Diffusion Probabilistic Models
07Week 7: Flow Matching and Consistency Models
08Week 8: Conditioning and Control
09Week 9: Latent Diffusion and Multimodal Generation
10Week 10: Evaluating Generative Models
11Week 11: Representation Learning with Generative Models
12Week 12: World Models and Reinforcement Learning
13Week 13: Safety, Misuse, and Alignment
14Week 14: Generative AI Capstone
Week 3

Week 3: Generative Adversarial Networks

✦Learning Outcomes
  • Analyze the mode collapse and training instability failure modes arising from JSD geometry
  • Implement Wasserstein GAN with gradient penalty and explain why the Wasserstein distance provides better gradients
  • Compare FID, precision, and recall as evaluation metrics for generative models
◆Prerequisites
  • Week 1: Probabilistic Foundations - KL divergence concepts
  • Week 2: Variational Autoencoders - Understanding divergence minimization

Basic knowledge of neural network classifiers is helpful but not required.

Purpose of this lecture

Generative adversarial networks (GANs; Goodfellow et al., 2014) produce the sharpest image samples of any generative model family but are also the most difficult to train. Understanding GANs deeply requires understanding the divergence geometry that underlies their objective, the mode collapse and training instability failures that follow from that geometry, and the Wasserstein reformulation that partially resolves both. GAN training dynamics also provide the conceptual foundation for understanding adversarial examples, discriminative fine-tuning of diffusion models, and RLHFReinforcement Learning from Human Feedback for image generation.


The min-max objective

A GAN consists of a generator Gθ:Z→XG_\theta: \mathcal{Z} \to \mathcal{X}Gθ​:Z→X that maps noise z∼p(z)z \sim p(z)z∼p(z) to samples, and a discriminator (or critic) Dϕ:X→[0,1]D_\phi: \mathcal{X} \to [0, 1]Dϕ​:X→[0,1] that classifies inputs as real (from pdatap_\text{data}pdata​) or generated (from pGp_GpG​). The training objective is:

min⁡θmax⁡ϕ  V(θ,ϕ)=Ex∼pdata ⁣[log⁡Dϕ(x)]+Ez∼p(z) ⁣[log⁡(1−Dϕ(Gθ(z)))]\min_\theta \max_\phi \; V(\theta, \phi) = \mathbb{E}_{x \sim p_\text{data}}\!\left[\log D_\phi(x)\right] + \mathbb{E}_{z \sim p(z)}\!\left[\log(1 - D_\phi(G_\theta(z)))\right]θmin​ϕmax​V(θ,ϕ)=Ex∼pdata​​[logDϕ​(x)]+Ez∼p(z)​[log(1−Dϕ​(Gθ​(z)))]

The discriminator maximizes VVV — it wants to assign high D(x)D(x)D(x) to real data and low D(G(z))D(G(z))D(G(z)) to generated samples. The generator minimizes VVV — it wants D(G(z))D(G(z))D(G(z)) to be high, fooling the discriminator.

The optimal discriminator: for fixed GGG, the discriminator that maximizes VVV is:

D∗(x)=pdata(x)pdata(x)+pG(x)D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x) + p_G(x)}D∗(x)=pdata​(x)+pG​(x)pdata​(x)​

This is the density ratio. The derivation: treating VVV as a functional integral over xxx, the integrand at each xxx is pdata(x)log⁡D+pG(x)log⁡(1−D)p_\text{data}(x) \log D + p_G(x) \log(1-D)pdata​(x)logD+pG​(x)log(1−D). Setting the derivative with respect to DDD to zero gives pdata(x)/D−pG(x)/(1−D)=0p_\text{data}(x)/D - p_G(x)/(1-D) = 0pdata​(x)/D−pG​(x)/(1−D)=0, yielding the expression above.

Substituting D∗D^*D∗ into VVV recovers the Jensen-Shannon divergence. Substituting:

V(θ,D∗)=Epdata ⁣[log⁡pdatapdata+pG]+EpG ⁣[log⁡pGpdata+pG]V(\theta, D^*) = \mathbb{E}_{p_\text{data}}\!\left[\log \frac{p_\text{data}}{p_\text{data} + p_G}\right] + \mathbb{E}_{p_G}\!\left[\log \frac{p_G}{p_\text{data} + p_G}\right]V(θ,D∗)=Epdata​​[logpdata​+pG​pdata​​]+EpG​​[logpdata​+pG​pG​​]

Add and subtract log⁡2\log 2log2 inside each expectation:

=Epdata ⁣[log⁡2 pdatapdata+pG]+EpG ⁣[log⁡2 pGpdata+pG]−log⁡4= \mathbb{E}_{p_\text{data}}\!\left[\log \frac{2\, p_\text{data}}{p_\text{data} + p_G}\right] + \mathbb{E}_{p_G}\!\left[\log \frac{2\, p_G}{p_\text{data} + p_G}\right] - \log 4=Epdata​​[logpdata​+pG​2pdata​​]+EpG​​[logpdata​+pG​2pG​​]−log4

The mixture distribution m=(pdata+pG)/2m = (p_\text{data} + p_G)/2m=(pdata​+pG​)/2 appears in both KL terms:

=DKL(pdata∥m)+DKL(pG∥m)−log⁡4=2 DJS(pdata∥pG)−log⁡4= D_\text{KL}(p_\text{data} \| m) + D_\text{KL}(p_G \| m) - \log 4 = 2\, D_\text{JS}(p_\text{data} \| p_G) - \log 4=DKL​(pdata​∥m)+DKL​(pG​∥m)−log4=2DJS​(pdata​∥pG​)−log4

where the Jensen-Shannon divergence is DJS(p∥q)=12DKL(p∥m)+12DKL(q∥m)D_\text{JS}(p \| q) = \frac{1}{2}D_\text{KL}(p \| m) + \frac{1}{2}D_\text{KL}(q \| m)DJS​(p∥q)=21​DKL​(p∥m)+21​DKL​(q∥m) with m=(p+q)/2m = (p+q)/2m=(p+q)/2. At the global optimum pG=pdatap_G = p_\text{data}pG​=pdata​, the JSD equals zero and V=−log⁡4V = -\log 4V=−log4. GAN training thus minimizes the JSD between the data and generator distributions — but only when the discriminator is simultaneously trained to optimality at each generator update step.


Training instability and the JSD failure mode

The JSD interpretation reveals a fundamental difficulty. When pdatap_\text{data}pdata​ and pGp_GpG​ have disjoint support — which is likely in high dimensions, where both distributions concentrate on low-dimensional manifolds — the JSD equals its maximum value log⁡2\log 2log2 regardless of the distance between the manifolds. The gradient of JSD with respect to the generator parameters is therefore zero almost everywhere: the discriminator can perfectly classify real from generated samples, and the generator receives no useful training signal.

In practice, early GAN training alternates between discriminator and generator updates without waiting for the discriminator to converge. This keeps the discriminator imperfect, providing non-zero gradients. But the balance is delicate: an overly strong discriminator saturates its output to 0 for all generated samples, zeroing gradients; an overly weak discriminator provides misleading signal. This tension is the root cause of GAN training instability.

Mode collapse is the failure mode where the generator produces only a subset of the data distribution's modes. If GGG learns to produce samples that fool the current discriminator, and the discriminator then updates to reject them, the generator may shift to a different mode — cycling through modes without converging. The JSD objective provides no penalty for missing modes because the JSD only measures how distinguishable pGp_GpG​ is from pdatap_\text{data}pdata​, not how many modes of pdatap_\text{data}pdata​ are covered.


Wasserstein GANs

Wasserstein GANs (WGAN; Arjovsky et al., 2017) replace the JSD with the Wasserstein-1 (Earth Mover's) distance, which has gradients even when the distributions have disjoint support:

W1(pdata,pG)=sup⁡∥f∥L≤1Ex∼pdata[f(x)]−Ex∼pG[f(x)]W_1(p_\text{data}, p_G) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\text{data}}[f(x)] - \mathbb{E}_{x \sim p_G}[f(x)]W1​(pdata​,pG​)=∥f∥L​≤1sup​Ex∼pdata​​[f(x)]−Ex∼pG​​[f(x)]

where the supremum is over 1-Lipschitz functions fff. The Wasserstein distance measures the minimum "work" required to transform pGp_GpG​ into pdatap_\text{data}pdata​, treating the distribution as a mass that must be transported. Unlike JSD, W1(pdata,pG)W_1(p_\text{data}, p_G)W1​(pdata​,pG​) is continuous and differentiable in pGp_GpG​ even when the supports are disjoint, providing meaningful gradients throughout training.

Geometric intuition: why EMD provides gradients while JSD does not. Consider two 1D distributions: pdata=δ(x−0)p_\text{data} = \delta(x - 0)pdata​=δ(x−0) (mass at 0) and pG=δ(x−δ)p_G = \delta(x - \delta)pG​=δ(x−δ) (mass at δ\deltaδ). The JSD between any two non-overlapping distributions equals log⁡2\log 2log2 regardless of δ\deltaδ: the gradient ∇δDJS=0\nabla_\delta D_\text{JS} = 0∇δ​DJS​=0 for all δ≠0\delta \neq 0δ=0. The generator gets no signal about which direction to move. The Wasserstein-1 distance between these same distributions is W1=∣δ∣W_1 = |\delta|W1​=∣δ∣, with gradient ∇δW1=sign(δ)\nabla_\delta W_1 = \text{sign}(\delta)∇δ​W1​=sign(δ) — a constant signal pointing toward δ=0\delta = 0δ=0 everywhere. For continuous distributions with partially overlapping supports, the contrast is less extreme but the principle holds: the EMD counts "how far" the mass must move, while the JSD counts only "whether" the distributions overlap. In high dimensions, where data and generator distributions concentrate on disjoint low-dimensional manifolds, the EMD advantage is essentially universal throughout training.

The WGAN critic (no longer a binary classifier) approximates the 1-Lipschitz function fff that maximizes the expected difference. The 1-Lipschitz constraint is enforced either by weight clipping (WGAN original: clip all critic weights to [−c,c][-c, c][−c,c]) or gradient penalty (WGAN-GP: add λ⋅(∥∇xf(x)∥2−1)2\lambda \cdot (\|\nabla_x f(x)\|_2 - 1)^2λ⋅(∥∇x​f(x)∥2​−1)2 evaluated at interpolations between real and generated samples).

WGAN-GP is the standard WGAN variant:

LWGAN-GP=EpG[fϕ(x)]−Epdata[fϕ(x)]⏟Wasserstein estimate (critic)+λEx^ ⁣[(∥∇x^fϕ(x^)∥2−1)2]\mathcal{L}_\text{WGAN-GP} = \underbrace{\mathbb{E}_{p_G}[f_\phi(x)] - \mathbb{E}_{p_\text{data}}[f_\phi(x)]}_{\text{Wasserstein estimate (critic)}} + \lambda \mathbb{E}_{\hat{x}}\!\left[(\|\nabla_{\hat{x}} f_\phi(\hat{x})\|_2 - 1)^2\right]LWGAN-GP​=Wasserstein estimate (critic)EpG​​[fϕ​(x)]−Epdata​​[fϕ​(x)]​​+λEx^​[(∥∇x^​fϕ​(x^)∥2​−1)2]

where x^=αxreal+(1−α)xfake\hat{x} = \alpha x_\text{real} + (1-\alpha) x_\text{fake}x^=αxreal​+(1−α)xfake​ for α∼Uniform(0,1)\alpha \sim \text{Uniform}(0,1)α∼Uniform(0,1). This produces more stable training and better mode coverage than standard GAN training, at the cost of reduced peak sample quality.


GAN training dynamics in practice

Practical GAN training requires balancing the discriminator and generator update rates carefully. Several heuristics have become standard.

Number of critic steps per generator step: in WGAN, the critic should be near-optimal at each generator step, which requires training it for ncritic=5n_\text{critic} = 5ncritic​=5 steps per generator step. Standard GAN training uses ncritic=1n_\text{critic} = 1ncritic​=1. With only one critic step, the discriminator may be too weak, providing misleading gradients; with too many, the discriminator may saturate (outputting near-0/1 everywhere), also killing gradients.

Learning rates and optimizer choices: GANs are notoriously sensitive to learning rate. The standard modern recommendation is Adam with β1=0.0\beta_1 = 0.0β1​=0.0 (no first-moment momentum) and β2=0.9\beta_2 = 0.9β2​=0.9, with separate learning rates for generator (lrG=10−4\text{lr}_G = 10^{-4}lrG​=10−4) and discriminator (lrD=4×10−4\text{lr}_D = 4 \times 10^{-4}lrD​=4×10−4). The asymmetric learning rates reflect that the discriminator task is often easier than the generator task. Using β1=0\beta_1 = 0β1​=0 rather than the default β1=0.9\beta_1 = 0.9β1​=0.9 reduces gradient oscillations that arise from the adversarial dynamics.

Minibatch discrimination: a collapse diagnostic that has become a standard network component. The discriminator receives information about the batch statistics (diversity across the mini-batch) in addition to individual samples, making it detect when the generator is producing identical samples. Without this, a generator can collapse to a single point that achieves low loss because the single-sample discriminator has no way to penalize reduced diversity.

Two-timescale update rule (TTUR): provides a theoretical foundation for separate learning rates by showing that the GAN game converges to a local Nash equilibrium when the discriminator uses a larger learning rate than the generator (converging faster to its local optimal response), analogous to the two-timescale asymptotic theory in actor-critic RLReinforcement Learning. This connection to the dual timescale convergence analysis from actor-critic methods (Course 1) is not coincidental — both are saddle-point optimization problems where stability requires one player to adapt faster than the other.


Spectral normalization

Spectral normalization (Miyato et al., 2018) enforces the Lipschitz constraint on the discriminator by dividing each weight matrix WWW by its spectral norm (largest singular value σmax(W)\sigma_\text{max}(W)σmax​(W)):

Wˉ=W/σmax(W)\bar{W} = W / \sigma_\text{max}(W)Wˉ=W/σmax​(W)

This ensures that the product of spectral norms across all layers bounds the Lipschitz constant of the full network. Spectral normalization is computationally cheaper than gradient penalty and has become the default stabilization method in modern GAN architectures (BigGAN, StyleGAN).


Conditional GANs

Conditional GANs (cGANs) augment both generator and discriminator with a conditioning signal yyy (class label, text description, or image):

min⁡θmax⁡ϕ  V(θ,ϕ)=E(x,y)∼pdata ⁣[log⁡Dϕ(x,y)]+Ez,y ⁣[log⁡(1−Dϕ(Gθ(z,y),y))]\min_\theta \max_\phi \; V(\theta, \phi) = \mathbb{E}_{(x, y) \sim p_\text{data}}\!\left[\log D_\phi(x, y)\right] + \mathbb{E}_{z, y}\!\left[\log(1 - D_\phi(G_\theta(z, y), y))\right]θmin​ϕmax​V(θ,ϕ)=E(x,y)∼pdata​​[logDϕ​(x,y)]+Ez,y​[log(1−Dϕ​(Gθ​(z,y),y))]

The generator learns to produce samples that match conditioning yyy; the discriminator learns to assess whether xxx is a plausible sample for the given yyy, not merely whether xxx looks real. Conditioning yyy can be injected through concatenation, conditional batch normalization, or cross-attention in the generator and discriminator.

Projection discriminator (Miyato and Koyama, 2018) injects label information through an inner product Dϕ(x,y)=ϕ(x)Tvy+f(ϕ(x))D_\phi(x, y) = \phi(x)^T v_y + f(\phi(x))Dϕ​(x,y)=ϕ(x)Tvy​+f(ϕ(x)) where vyv_yvy​ is a learned class embedding and fff is a learned scalar. This separates the class-conditional and class-unconditional components of the discriminator.


Progressive growing and StyleGAN

Progressive GAN (Karras et al., 2018) trains GANs at increasing resolutions, starting at 4×44\times 44×4 and progressively adding layers for higher resolution. This stabilizes training because low-resolution stages are easy and provide well-behaved gradients.

StyleGAN (Karras et al., 2019) introduces an explicit style latent w=f(z)w = f(z)w=f(z) (a learned mapping of the noise) injected into each layer through adaptive instance normalization (AdaIN), enabling disentangled control over coarse (pose, shape) and fine (texture, color) attributes. StyleGAN produces the sharpest samples of any GAN and provides the most interpretable latent space — directions in www-space correspond to semantic edits.


Evaluating GANs: Fréchet Inception Distance

Unlike likelihood-based models, GANs cannot be evaluated by test-set log-likelihood. The standard evaluation metric is Fréchet Inception Distance (FID; Heusel et al., 2017), which measures the distance between the feature distributions of real and generated images using statistics extracted from a pretrained InceptionV3 network.

Let μr,Σr\mu_r, \Sigma_rμr​,Σr​ be the mean and covariance of InceptionV3 features over real images, and μg,Σg\mu_g, \Sigma_gμg​,Σg​ over generated images. The FID is:

FID=∥μr−μg∥22+tr ⁣(Σr+Σg−2(ΣrΣg)1/2)\text{FID} = \|\mu_r - \mu_g\|_2^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)FID=∥μr​−μg​∥22​+tr(Σr​+Σg​−2(Σr​Σg​)1/2)

Lower FID is better: zero would mean the feature distributions are identical. FID is sensitive to both sample quality (incorrect feature statistics) and mode coverage (missing modes shift the mean). It does not decompose these contributions, which is why Precision and Recall metrics (Kynkäänniemi et al., 2019) are increasingly used alongside FID: Precision measures the fraction of generated samples that fall within the real distribution's support (sample quality), while Recall measures the fraction of real data covered by the generated distribution (mode coverage). A model with high Precision but low Recall is mode-dropping; a model with high Recall but low Precision is generating blurry or unrealistic samples.


GenAI context: GANs beyond image generation

The GAN framework's influence extends well beyond image generation. RLHFReinforcement Learning from Human Feedback for language models is structurally analogous: a reward model trained on human preferences acts as the discriminator, and policy gradient updates ACTAction Chunking with Transformers as the generator update — the policy is trained to produce outputs that the reward model rates highly, just as the generator is trained to produce samples that the discriminator cannot distinguish from real data. The divergence minimization connection holds too: RLHFReinforcement Learning from Human Feedback with KL penalty minimizes reverse KL between the policy and a reference, which parallels the mode-seeking behavior of GAN training.

The adversarial game also appears in diffusion model distillation: ADD (Sauer et al., 2023) uses a discriminator to provide gradients that sharpen the samples of a single-step distilled diffusion model, recovering GAN-level sharpness while retaining the stability of diffusion-based training. This hybrid approach — using the discriminator's gradient signal without using the JSD objective end-to-end — represents the current state of the art for fast, high-quality image generation. In robotic imitation learning, adversarial IRL (Gail, Ho and Ermon, 2016) applies the GAN framework directly to trajectory matching: the discriminator distinguishes expert from policy trajectories, and the policy is trained with the discriminator's signal as a reward — eliminating the need for manual reward design.


Key takeaways

The GAN objective minimizes the Jensen-Shannon divergence between the data and generator distributions when the discriminator is trained to optimality. JSD is zero when distributions have disjoint support, causing vanishing generator gradients and mode collapse. Wasserstein GANs replace JSD with the earth mover's distance, which provides meaningful gradients for non-overlapping distributions and improves mode coverage. The Lipschitz constraint required by the Wasserstein formulation is enforced by gradient penalty (WGAN-GP) or spectral normalization. Conditional GANs inject class or text conditioning into both generator and discriminator. StyleGAN's www-space mapping disentangles style attributes and enables semantic image editing.


Conceptual questions

  1. Derive that the GAN value function V(θ,D∗)V(\theta, D^*)V(θ,D∗) with the optimal discriminator equals 2DJS(pdata∥pG)−log⁡42 D_\text{JS}(p_\text{data} \| p_G) - \log 42DJS​(pdata​∥pG​)−log4. Then show that when pdatap_\text{data}pdata​ and pGp_GpG​ have disjoint support, DJS=log⁡2D_\text{JS} = \log 2DJS​=log2 regardless of the geometric distance between the distributions. Why does this imply that the generator's gradient is zero in this regime?

  2. The Wasserstein-1 distance between two Gaussians N(μ1,σ2I)\mathcal{N}(\mu_1, \sigma^2 I)N(μ1​,σ2I) and N(μ2,σ2I)\mathcal{N}(\mu_2, \sigma^2 I)N(μ2​,σ2I) in Rd\mathbb{R}^dRd equals ∥μ1−μ2∥2\|\mu_1 - \mu_2\|_2∥μ1​−μ2​∥2​. Show this using the optimal transport interpretation. How does this compare to the KL divergence between the same two Gaussians, and what does this imply about which distance provides better gradients early in GAN training?

  3. A WGAN-GP critic is trained on images from two modes of a distribution: one mode at high contrast and one at low contrast. After training, the generator produces only high-contrast images (mode collapse). Analyze whether WGAN-GP should theoretically prevent this behavior, and if it does not, identify the failure mechanism. What modification to the training procedure would encourage the generator to cover both modes?

  4. Spectral normalization divides each weight matrix by its spectral norm to enforce Lipschitz continuity. If a discriminator has LLL layers each with spectral norm exactly 1 after normalization, show that the Lipschitz constant of the full network is bounded by 1L=11^L = 11L=1. Now suppose during training, one layer's weight matrix develops a spectral norm of 1.5 before renormalization at the next gradient step. What is the actual Lipschitz constant of the network during this interval, and does it satisfy the WGAN requirement?

  5. StyleGAN injects style codes www into each layer via adaptive instance normalization: the feature map xxx at each layer is normalized and then rescaled by (γw,βw)(\gamma_w, \beta_w)(γw​,βw​) derived from www. Explain how this allows the same generator architecture to produce images spanning many styles without catastrophic interference between styles. What failure mode would you expect if www were injected only at the first layer rather than at every layer?

✦Solutions
  1. With disjoint supports, at every xxx only one of pdata,pGp_\text{data}, p_Gpdata​,pG​ is nonzero, so m=(pdata+pG)/2m=(p_\text{data}+p_G)/2m=(pdata​+pG​)/2 equals pdata/2p_\text{data}/2pdata​/2 on the data manifold and pG/2p_G/2pG​/2 on the generator manifold. Then DKL(pdata∥m)=log⁡2D_\text{KL}(p_\text{data}\|m)=\log 2DKL​(pdata​∥m)=log2 and DKL(pG∥m)=log⁡2D_\text{KL}(p_G\|m)=\log 2DKL​(pG​∥m)=log2, giving DJS=log⁡2D_\text{JS}=\log 2DJS​=log2 for any separation. Because the value is constant in the geometric distance, ∇θV=0\nabla_\theta V = 0∇θ​V=0 — the generator receives no directional signal.
  2. For equal-covariance Gaussians the optimal transport plan is a rigid translation, so W1=∥μ1−μ2∥2W_1=\|\mu_1-\mu_2\|_2W1​=∥μ1​−μ2​∥2​ (gradient is a constant unit direction). The KL is ∥μ1−μ2∥2/(2σ2)\|\mu_1-\mu_2\|^2/(2\sigma^2)∥μ1​−μ2​∥2/(2σ2), which blows up as σ→0\sigma\to 0σ→0 and whose gradient is scaled by 1/σ21/\sigma^21/σ2 — it explodes or vanishes depending on overlap. W1W_1W1​ gives a stable, well-scaled gradient even with little/no overlap, which is exactly the early-training regime.
  3. WGAN-GP only mitigates collapse: W1W_1W1​ is estimated by a finite-capacity critic trained for finite ncriticn_\text{critic}ncritic​ steps, so the estimate is imperfect and may not "see" a missing mode if both modes map to similar critic values. The mechanism is that perfecting one mode can lower the critic loss faster than covering both. Fixes: more critic steps, minibatch discrimination / unrolled GAN, instance noise, or multiple generators.
  4. The Lipschitz constant of a composition is bounded by the product of layer Lipschitz constants: 1L=11^L=11L=1. If one layer reaches spectral norm 1.51.51.5 before the next renormalization, the network's Lipschitz bound is temporarily 1.51.51.5, violating the 1-Lipschitz requirement — the critic is briefly too steep and the W1W_1W1​ estimate is biased upward until renormalization restores it.
  5. AdaIN re-applies style at every scale: each layer normalizes away the previous style and rescales by (γw,βw)(\gamma_w,\beta_w)(γw​,βw​), while the (style-independent) conv weights carry content — so styles compose across scales without interference. Injecting www only at layer 1 lets later layers' normalizations wash the style out, costing fine-scale control: you get coarse-style-only generation and weak coarse/fine disentanglement.

Looking ahead

GANs produce sharp samples but suffer from training instability and mode collapse. The next two model families take a completely different approach: defining distributions through explicit energy functions or score functions rather than implicit generators.

Week 4: Energy-Based Models and Score Matching. We examine how to define pθ(x)∝e−Eθ(x)p_\theta(x) \propto e^{-E_\theta(x)}pθ​(x)∝e−Eθ​(x), why the partition function makes direct MLE intractable, and how score matching and denoising score matching enable learning without computing normalizing constants.


Further reading

  • Goodfellow, I. J., et al. (2014). Generative Adversarial Nets. NeurIPS. (The seminal GAN paper).
  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ICML. (Theoretical resolution of mode collapse and vanishing gradients).
  • Karras, T., Laine, S., & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN). CVPR.
  • Gulrajani, I., et al. (2017). Improved Training of Wasserstein GANs (WGAN-GP). NeurIPS.
  • Heusel, M., et al. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID, TTUR). NeurIPS.
← Previous
Week 2: Variational Autoencoders
Next →
Week 4: Energy-Based Models and Score Matching
On this page
  • Purpose of this lecture
  • The min-max objective
  • Training instability and the JSD failure mode
  • Wasserstein GANs
  • GAN training dynamics in practice
  • Spectral normalization
  • Conditional GANs
  • Progressive growing and StyleGAN
  • Evaluating GANs: Fréchet Inception Distance
  • GenAI context: GANs beyond image generation
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading