Week 3: Generative Adversarial Networks

Purpose of this lecture#

Generative adversarial networks (GANs; Goodfellow et al., 2014) produce the sharpest image samples of any generative model family but are also the most difficult to train. Understanding GANs deeply requires understanding the divergence geometry that underlies their objective, the mode collapse and training instability failures that follow from that geometry, and the Wasserstein reformulation that partially resolves both. GAN training dynamics also provide the conceptual foundation for understanding adversarial examples, discriminative fine-tuning of diffusion models, and RLHF for image generation.

The min-max objective#

A GAN consists of a generator $G_\theta: \mathcal{Z} \to \mathcal{X}$ that maps noise $z \sim p(z)$ to samples, and a discriminator (or critic) $D_\phi: \mathcal{X} \to [0, 1]$ that classifies inputs as real (from $p_\text{data}$ ) or generated (from $p_G$ ). The training objective is:

\min_\theta \max_\phi \; V(\theta, \phi) = \mathbb{E}_{x \sim p_\text{data}}\!\left[\log D_\phi(x)\right] + \mathbb{E}_{z \sim p(z)}\!\left[\log(1 - D_\phi(G_\theta(z)))\right]

The discriminator maximizes $V$ — it wants to assign high $D(x)$ to real data and low $D(G(z))$ to generated samples. The generator minimizes $V$ — it wants $D(G(z))$ to be high, fooling the discriminator.

The optimal discriminator: for fixed $G$ , the discriminator that maximizes $V$ is:

D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x) + p_G(x)}

This is the density ratio. The derivation: treating $V$ as a functional integral over $x$ , the integrand at each $x$ is $p_\text{data}(x) \log D + p_G(x) \log(1-D)$ . Setting the derivative with respect to $D$ to zero gives $p_\text{data}(x)/D - p_G(x)/(1-D) = 0$ , yielding the expression above.

Substituting $D^*$ into $V$ recovers the Jensen-Shannon divergence. Substituting:

V(\theta, D^*) = \mathbb{E}_{p_\text{data}}\!\left[\log \frac{p_\text{data}}{p_\text{data} + p_G}\right] + \mathbb{E}_{p_G}\!\left[\log \frac{p_G}{p_\text{data} + p_G}\right]

Add and subtract $\log 2$ inside each expectation:

= \mathbb{E}_{p_\text{data}}\!\left[\log \frac{2\, p_\text{data}}{p_\text{data} + p_G}\right] + \mathbb{E}_{p_G}\!\left[\log \frac{2\, p_G}{p_\text{data} + p_G}\right] - \log 4

The mixture distribution $m = (p_\text{data} + p_G)/2$ appears in both KL terms:

= D_\text{KL}(p_\text{data} \| m) + D_\text{KL}(p_G \| m) - \log 4 = 2\, D_\text{JS}(p_\text{data} \| p_G) - \log 4

where the Jensen-Shannon divergence is $D_\text{JS}(p \| q) = \frac{1}{2}D_\text{KL}(p \| m) + \frac{1}{2}D_\text{KL}(q \| m)$ with $m = (p+q)/2$ . At the global optimum $p_G = p_\text{data}$ , the JSD equals zero and $V = -\log 4$ . GAN training thus minimizes the JSD between the data and generator distributions — but only when the discriminator is simultaneously trained to optimality at each generator update step.

Training instability and the JSD failure mode#

The JSD interpretation reveals a fundamental difficulty. When $p_\text{data}$ and $p_G$ have disjoint support — which is likely in high dimensions, where both distributions concentrate on low-dimensional manifolds — the JSD equals its maximum value $\log 2$ regardless of the distance between the manifolds. The gradient of JSD with respect to the generator parameters is therefore zero almost everywhere: the discriminator can perfectly classify real from generated samples, and the generator receives no useful training signal.

In practice, early GAN training alternates between discriminator and generator updates without waiting for the discriminator to converge. This keeps the discriminator imperfect, providing non-zero gradients. But the balance is delicate: an overly strong discriminator saturates its output to 0 for all generated samples, zeroing gradients; an overly weak discriminator provides misleading signal. This tension is the root cause of GAN training instability.

Mode collapse is the failure mode where the generator produces only a subset of the data distribution's modes. If $G$ learns to produce samples that fool the current discriminator, and the discriminator then updates to reject them, the generator may shift to a different mode — cycling through modes without converging. The JSD objective provides no penalty for missing modes because the JSD only measures how distinguishable $p_G$ is from $p_\text{data}$ , not how many modes of $p_\text{data}$ are covered.

Wasserstein GANs#

Wasserstein GANs (WGAN; Arjovsky et al., 2017) replace the JSD with the Wasserstein-1 (Earth Mover's) distance, which has gradients even when the distributions have disjoint support:

W_1(p_\text{data}, p_G) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\text{data}}[f(x)] - \mathbb{E}_{x \sim p_G}[f(x)]

where the supremum is over 1-Lipschitz functions $f$ . The Wasserstein distance measures the minimum "work" required to transform $p_G$ into $p_\text{data}$ , treating the distribution as a mass that must be transported. Unlike JSD, $W_1(p_\text{data}, p_G)$ is continuous and differentiable in $p_G$ even when the supports are disjoint, providing meaningful gradients throughout training.

Geometric intuition: why EMD provides gradients while JSD does not. Consider two 1D distributions: $p_\text{data} = \delta(x - 0)$ (mass at 0) and $p_G = \delta(x - \delta)$ (mass at $\delta$ ). The JSD between any two non-overlapping distributions equals $\log 2$ regardless of $\delta$ : the gradient $\nabla_\delta D_\text{JS} = 0$ for all $\delta \neq 0$ . The generator gets no signal about which direction to move. The Wasserstein-1 distance between these same distributions is $W_1 = |\delta|$ , with gradient $\nabla_\delta W_1 = \text{sign}(\delta)$ — a constant signal pointing toward $\delta = 0$ everywhere. For continuous distributions with partially overlapping supports, the contrast is less extreme but the principle holds: the EMD counts "how far" the mass must move, while the JSD counts only "whether" the distributions overlap. In high dimensions, where data and generator distributions concentrate on disjoint low-dimensional manifolds, the EMD advantage is essentially universal throughout training.

The WGAN critic (no longer a binary classifier) approximates the 1-Lipschitz function $f$ that maximizes the expected difference. The 1-Lipschitz constraint is enforced either by weight clipping (WGAN original: clip all critic weights to $[-c, c]$ ) or gradient penalty (WGAN-GP: add $\lambda \cdot (\|\nabla_x f(x)\|_2 - 1)^2$ evaluated at interpolations between real and generated samples).

WGAN-GP is the standard WGAN variant:

\mathcal{L}_\text{WGAN-GP} = \underbrace{\mathbb{E}_{p_G}[f_\phi(x)] - \mathbb{E}_{p_\text{data}}[f_\phi(x)]}_{\text{Wasserstein estimate (critic)}} + \lambda \mathbb{E}_{\hat{x}}\!\left[(\|\nabla_{\hat{x}} f_\phi(\hat{x})\|_2 - 1)^2\right]

where $\hat{x} = \alpha x_\text{real} + (1-\alpha) x_\text{fake}$ for $\alpha \sim \text{Uniform}(0,1)$ . This produces more stable training and better mode coverage than standard GAN training, at the cost of reduced peak sample quality.

GAN training dynamics in practice#

Practical GAN training requires balancing the discriminator and generator update rates carefully. Several heuristics have become standard.

Number of critic steps per generator step: in WGAN, the critic should be near-optimal at each generator step, which requires training it for $n_\text{critic} = 5$ steps per generator step. Standard GAN training uses $n_\text{critic} = 1$ . With only one critic step, the discriminator may be too weak, providing misleading gradients; with too many, the discriminator may saturate (outputting near-0/1 everywhere), also killing gradients.

Learning rates and optimizer choices: GANs are notoriously sensitive to learning rate. The standard modern recommendation is Adam with $\beta_1 = 0.0$ (no first-moment momentum) and $\beta_2 = 0.9$ , with separate learning rates for generator ( $\text{lr}_G = 10^{-4}$ ) and discriminator ( $\text{lr}_D = 4 \times 10^{-4}$ ). The asymmetric learning rates reflect that the discriminator task is often easier than the generator task. Using $\beta_1 = 0$ rather than the default $\beta_1 = 0.9$ reduces gradient oscillations that arise from the adversarial dynamics.

Minibatch discrimination: a collapse diagnostic that has become a standard network component. The discriminator receives information about the batch statistics (diversity across the mini-batch) in addition to individual samples, making it detect when the generator is producing identical samples. Without this, a generator can collapse to a single point that achieves low loss because the single-sample discriminator has no way to penalize reduced diversity.

Two-timescale update rule (TTUR): provides a theoretical foundation for separate learning rates by showing that the GAN game converges to a local Nash equilibrium when the discriminator uses a larger learning rate than the generator (converging faster to its local optimal response), analogous to the two-timescale asymptotic theory in actor-critic RL. This connection to the dual timescale convergence analysis from actor-critic methods (Course 1) is not coincidental — both are saddle-point optimization problems where stability requires one player to adapt faster than the other.

Spectral normalization#

Spectral normalization (Miyato et al., 2018) enforces the Lipschitz constraint on the discriminator by dividing each weight matrix $W$ by its spectral norm (largest singular value $\sigma_\text{max}(W)$ ):

\bar{W} = W / \sigma_\text{max}(W)

This ensures that the product of spectral norms across all layers bounds the Lipschitz constant of the full network. Spectral normalization is computationally cheaper than gradient penalty and has become the default stabilization method in modern GAN architectures (BigGAN, StyleGAN).

Conditional GANs#

Conditional GANs (cGANs) augment both generator and discriminator with a conditioning signal $y$ (class label, text description, or image):

\min_\theta \max_\phi \; V(\theta, \phi) = \mathbb{E}_{(x, y) \sim p_\text{data}}\!\left[\log D_\phi(x, y)\right] + \mathbb{E}_{z, y}\!\left[\log(1 - D_\phi(G_\theta(z, y), y))\right]

The generator learns to produce samples that match conditioning $y$ ; the discriminator learns to assess whether $x$ is a plausible sample for the given $y$ , not merely whether $x$ looks real. Conditioning $y$ can be injected through concatenation, conditional batch normalization, or cross-attention in the generator and discriminator.

Projection discriminator (Miyato and Koyama, 2018) injects label information through an inner product $D_\phi(x, y) = \phi(x)^T v_y + f(\phi(x))$ where $v_y$ is a learned class embedding and $f$ is a learned scalar. This separates the class-conditional and class-unconditional components of the discriminator.

Progressive growing and StyleGAN#

Progressive GAN (Karras et al., 2018) trains GANs at increasing resolutions, starting at $4\times 4$ and progressively adding layers for higher resolution. This stabilizes training because low-resolution stages are easy and provide well-behaved gradients.

StyleGAN (Karras et al., 2019) introduces an explicit style latent $w = f(z)$ (a learned mapping of the noise) injected into each layer through adaptive instance normalization (AdaIN), enabling disentangled control over coarse (pose, shape) and fine (texture, color) attributes. StyleGAN produces the sharpest samples of any GAN and provides the most interpretable latent space — directions in $w$ -space correspond to semantic edits.

Evaluating GANs: Fréchet Inception Distance#

Unlike likelihood-based models, GANs cannot be evaluated by test-set log-likelihood. The standard evaluation metric is Fréchet Inception Distance (FID; Heusel et al., 2017), which measures the distance between the feature distributions of real and generated images using statistics extracted from a pretrained InceptionV3 network.

Let $\mu_r, \Sigma_r$ be the mean and covariance of InceptionV3 features over real images, and $\mu_g, \Sigma_g$ over generated images. The FID is:

\text{FID} = \|\mu_r - \mu_g\|_2^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)

Lower FID is better: zero would mean the feature distributions are identical. FID is sensitive to both sample quality (incorrect feature statistics) and mode coverage (missing modes shift the mean). It does not decompose these contributions, which is why Precision and Recall metrics (Kynkäänniemi et al., 2019) are increasingly used alongside FID: Precision measures the fraction of generated samples that fall within the real distribution's support (sample quality), while Recall measures the fraction of real data covered by the generated distribution (mode coverage). A model with high Precision but low Recall is mode-dropping; a model with high Recall but low Precision is generating blurry or unrealistic samples.

GenAI context: GANs beyond image generation#

The GAN framework's influence extends well beyond image generation. RLHF for language models is structurally analogous: a reward model trained on human preferences acts as the discriminator, and policy gradient updates ACT as the generator update — the policy is trained to produce outputs that the reward model rates highly, just as the generator is trained to produce samples that the discriminator cannot distinguish from real data. The divergence minimization connection holds too: RLHF with KL penalty minimizes reverse KL between the policy and a reference, which parallels the mode-seeking behavior of GAN training.

The adversarial game also appears in diffusion model distillation: ADD (Sauer et al., 2023) uses a discriminator to provide gradients that sharpen the samples of a single-step distilled diffusion model, recovering GAN-level sharpness while retaining the stability of diffusion-based training. This hybrid approach — using the discriminator's gradient signal without using the JSD objective end-to-end — represents the current state of the art for fast, high-quality image generation. In robotic imitation learning, adversarial IRL (Gail, Ho and Ermon, 2016) applies the GAN framework directly to trajectory matching: the discriminator distinguishes expert from policy trajectories, and the policy is trained with the discriminator's signal as a reward — eliminating the need for manual reward design.

Key takeaways#

The GAN objective minimizes the Jensen-Shannon divergence between the data and generator distributions when the discriminator is trained to optimality. JSD is zero when distributions have disjoint support, causing vanishing generator gradients and mode collapse. Wasserstein GANs replace JSD with the earth mover's distance, which provides meaningful gradients for non-overlapping distributions and improves mode coverage. The Lipschitz constraint required by the Wasserstein formulation is enforced by gradient penalty (WGAN-GP) or spectral normalization. Conditional GANs inject class or text conditioning into both generator and discriminator. StyleGAN's $w$ -space mapping disentangles style attributes and enables semantic image editing.

Conceptual questions#

Derive that the GAN value function $V(\theta, D^*)$ with the optimal discriminator equals $2 D_\text{JS}(p_\text{data} \| p_G) - \log 4$ . Then show that when $p_\text{data}$ and $p_G$ have disjoint support, $D_\text{JS} = \log 2$ regardless of the geometric distance between the distributions. Why does this imply that the generator's gradient is zero in this regime?
The Wasserstein-1 distance between two Gaussians $\mathcal{N}(\mu_1, \sigma^2 I)$ and $\mathcal{N}(\mu_2, \sigma^2 I)$ in $\mathbb{R}^d$ equals $\|\mu_1 - \mu_2\|_2$ . Show this using the optimal transport interpretation. How does this compare to the KL divergence between the same two Gaussians, and what does this imply about which distance provides better gradients early in GAN training?
A WGAN-GP critic is trained on images from two modes of a distribution: one mode at high contrast and one at low contrast. After training, the generator produces only high-contrast images (mode collapse). Analyze whether WGAN-GP should theoretically prevent this behavior, and if it does not, identify the failure mechanism. What modification to the training procedure would encourage the generator to cover both modes?
Spectral normalization divides each weight matrix by its spectral norm to enforce Lipschitz continuity. If a discriminator has $L$ layers each with spectral norm exactly 1 after normalization, show that the Lipschitz constant of the full network is bounded by $1^L = 1$ . Now suppose during training, one layer's weight matrix develops a spectral norm of 1.5 before renormalization at the next gradient step. What is the actual Lipschitz constant of the network during this interval, and does it satisfy the WGAN requirement?
StyleGAN injects style codes $w$ into each layer via adaptive instance normalization: the feature map $x$ at each layer is normalized and then rescaled by $(\gamma_w, \beta_w)$ derived from $w$ . Explain how this allows the same generator architecture to produce images spanning many styles without catastrophic interference between styles. What failure mode would you expect if $w$ were injected only at the first layer rather than at every layer?

Solutions

With disjoint supports, at every $x$ only one of $p_\text{data}, p_G$ is nonzero, so $m=(p_\text{data}+p_G)/2$ equals $p_\text{data}/2$ on the data manifold and $p_G/2$ on the generator manifold. Then $D_\text{KL}(p_\text{data}\|m)=\log 2$ and $D_\text{KL}(p_G\|m)=\log 2$ , giving $D_\text{JS}=\log 2$ for any separation. Because the value is constant in the geometric distance, $\nabla_\theta V = 0$ — the generator receives no directional signal.
For equal-covariance Gaussians the optimal transport plan is a rigid translation, so $W_1=\|\mu_1-\mu_2\|_2$ (gradient is a constant unit direction). The KL is $\|\mu_1-\mu_2\|^2/(2\sigma^2)$ , which blows up as $\sigma\to 0$ and whose gradient is scaled by $1/\sigma^2$ — it explodes or vanishes depending on overlap. $W_1$ gives a stable, well-scaled gradient even with little/no overlap, which is exactly the early-training regime.
WGAN-GP only mitigates collapse: $W_1$ is estimated by a finite-capacity critic trained for finite $n_\text{critic}$ steps, so the estimate is imperfect and may not "see" a missing mode if both modes map to similar critic values. The mechanism is that perfecting one mode can lower the critic loss faster than covering both. Fixes: more critic steps, minibatch discrimination / unrolled GAN, instance noise, or multiple generators.
The Lipschitz constant of a composition is bounded by the product of layer Lipschitz constants: $1^L=1$ . If one layer reaches spectral norm $1.5$ before the next renormalization, the network's Lipschitz bound is temporarily $1.5$ , violating the 1-Lipschitz requirement — the critic is briefly too steep and the $W_1$ estimate is biased upward until renormalization restores it.
AdaIN re-applies style at every scale: each layer normalizes away the previous style and rescales by $(\gamma_w,\beta_w)$ , while the (style-independent) conv weights carry content — so styles compose across scales without interference. Injecting $w$ only at layer 1 lets later layers' normalizations wash the style out, costing fine-scale control: you get coarse-style-only generation and weak coarse/fine disentanglement.

Looking ahead#

GANs produce sharp samples but suffer from training instability and mode collapse. The next two model families take a completely different approach: defining distributions through explicit energy functions or score functions rather than implicit generators.

Week 4: Energy-Based Models and Score Matching. We examine how to define $p_\theta(x) \propto e^{-E_\theta(x)}$ , why the partition function makes direct MLE intractable, and how score matching and denoising score matching enable learning without computing normalizing constants.

Purpose of this lecture#

The min-max objective#

\min_\theta \max_\phi \; V(\theta, \phi) = \mathbb{E}_{x \sim p_\text{data}}\!\left[\log D_\phi(x)\right] + \mathbb{E}_{z \sim p(z)}\!\left[\log(1 - D_\phi(G_\theta(z)))\right]

The optimal discriminator: for fixed $G$ , the discriminator that maximizes $V$ is:

D^*(x) = \frac{p_\text{data}(x)}{p_\text{data}(x) + p_G(x)}

Substituting $D^*$ into $V$ recovers the Jensen-Shannon divergence. Substituting:

V(\theta, D^*) = \mathbb{E}_{p_\text{data}}\!\left[\log \frac{p_\text{data}}{p_\text{data} + p_G}\right] + \mathbb{E}_{p_G}\!\left[\log \frac{p_G}{p_\text{data} + p_G}\right]

Add and subtract $\log 2$ inside each expectation:

= \mathbb{E}_{p_\text{data}}\!\left[\log \frac{2\, p_\text{data}}{p_\text{data} + p_G}\right] + \mathbb{E}_{p_G}\!\left[\log \frac{2\, p_G}{p_\text{data} + p_G}\right] - \log 4

The mixture distribution $m = (p_\text{data} + p_G)/2$ appears in both KL terms:

= D_\text{KL}(p_\text{data} \| m) + D_\text{KL}(p_G \| m) - \log 4 = 2\, D_\text{JS}(p_\text{data} \| p_G) - \log 4

Training instability and the JSD failure mode#

Wasserstein GANs#

Wasserstein GANs (WGAN; Arjovsky et al., 2017) replace the JSD with the Wasserstein-1 (Earth Mover's) distance, which has gradients even when the distributions have disjoint support:

W_1(p_\text{data}, p_G) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{x \sim p_\text{data}}[f(x)] - \mathbb{E}_{x \sim p_G}[f(x)]

WGAN-GP is the standard WGAN variant:

\mathcal{L}_\text{WGAN-GP} = \underbrace{\mathbb{E}_{p_G}[f_\phi(x)] - \mathbb{E}_{p_\text{data}}[f_\phi(x)]}_{\text{Wasserstein estimate (critic)}} + \lambda \mathbb{E}_{\hat{x}}\!\left[(\|\nabla_{\hat{x}} f_\phi(\hat{x})\|_2 - 1)^2\right]

GAN training dynamics in practice#

Practical GAN training requires balancing the discriminator and generator update rates carefully. Several heuristics have become standard.

Spectral normalization#

\bar{W} = W / \sigma_\text{max}(W)

Conditional GANs#

Conditional GANs (cGANs) augment both generator and discriminator with a conditioning signal $y$ (class label, text description, or image):

\min_\theta \max_\phi \; V(\theta, \phi) = \mathbb{E}_{(x, y) \sim p_\text{data}}\!\left[\log D_\phi(x, y)\right] + \mathbb{E}_{z, y}\!\left[\log(1 - D_\phi(G_\theta(z, y), y))\right]

Progressive growing and StyleGAN#

Evaluating GANs: Fréchet Inception Distance#

Let $\mu_r, \Sigma_r$ be the mean and covariance of InceptionV3 features over real images, and $\mu_g, \Sigma_g$ over generated images. The FID is:

\text{FID} = \|\mu_r - \mu_g\|_2^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)

GenAI context: GANs beyond image generation#

Key takeaways#

Conceptual questions#

Derive that the GAN value function $V(\theta, D^*)$ with the optimal discriminator equals $2 D_\text{JS}(p_\text{data} \| p_G) - \log 4$ . Then show that when $p_\text{data}$ and $p_G$ have disjoint support, $D_\text{JS} = \log 2$ regardless of the geometric distance between the distributions. Why does this imply that the generator's gradient is zero in this regime?
The Wasserstein-1 distance between two Gaussians $\mathcal{N}(\mu_1, \sigma^2 I)$ and $\mathcal{N}(\mu_2, \sigma^2 I)$ in $\mathbb{R}^d$ equals $\|\mu_1 - \mu_2\|_2$ . Show this using the optimal transport interpretation. How does this compare to the KL divergence between the same two Gaussians, and what does this imply about which distance provides better gradients early in GAN training?
A WGAN-GP critic is trained on images from two modes of a distribution: one mode at high contrast and one at low contrast. After training, the generator produces only high-contrast images (mode collapse). Analyze whether WGAN-GP should theoretically prevent this behavior, and if it does not, identify the failure mechanism. What modification to the training procedure would encourage the generator to cover both modes?
Spectral normalization divides each weight matrix by its spectral norm to enforce Lipschitz continuity. If a discriminator has $L$ layers each with spectral norm exactly 1 after normalization, show that the Lipschitz constant of the full network is bounded by $1^L = 1$ . Now suppose during training, one layer's weight matrix develops a spectral norm of 1.5 before renormalization at the next gradient step. What is the actual Lipschitz constant of the network during this interval, and does it satisfy the WGAN requirement?
StyleGAN injects style codes $w$ into each layer via adaptive instance normalization: the feature map $x$ at each layer is normalized and then rescaled by $(\gamma_w, \beta_w)$ derived from $w$ . Explain how this allows the same generator architecture to produce images spanning many styles without catastrophic interference between styles. What failure mode would you expect if $w$ were injected only at the first layer rather than at every layer?

Solutions

With disjoint supports, at every $x$ only one of $p_\text{data}, p_G$ is nonzero, so $m=(p_\text{data}+p_G)/2$ equals $p_\text{data}/2$ on the data manifold and $p_G/2$ on the generator manifold. Then $D_\text{KL}(p_\text{data}\|m)=\log 2$ and $D_\text{KL}(p_G\|m)=\log 2$ , giving $D_\text{JS}=\log 2$ for any separation. Because the value is constant in the geometric distance, $\nabla_\theta V = 0$ — the generator receives no directional signal.
For equal-covariance Gaussians the optimal transport plan is a rigid translation, so $W_1=\|\mu_1-\mu_2\|_2$ (gradient is a constant unit direction). The KL is $\|\mu_1-\mu_2\|^2/(2\sigma^2)$ , which blows up as $\sigma\to 0$ and whose gradient is scaled by $1/\sigma^2$ — it explodes or vanishes depending on overlap. $W_1$ gives a stable, well-scaled gradient even with little/no overlap, which is exactly the early-training regime.
WGAN-GP only mitigates collapse: $W_1$ is estimated by a finite-capacity critic trained for finite $n_\text{critic}$ steps, so the estimate is imperfect and may not "see" a missing mode if both modes map to similar critic values. The mechanism is that perfecting one mode can lower the critic loss faster than covering both. Fixes: more critic steps, minibatch discrimination / unrolled GAN, instance noise, or multiple generators.
The Lipschitz constant of a composition is bounded by the product of layer Lipschitz constants: $1^L=1$ . If one layer reaches spectral norm $1.5$ before the next renormalization, the network's Lipschitz bound is temporarily $1.5$ , violating the 1-Lipschitz requirement — the critic is briefly too steep and the $W_1$ estimate is biased upward until renormalization restores it.
AdaIN re-applies style at every scale: each layer normalizes away the previous style and rescales by $(\gamma_w,\beta_w)$ , while the (style-independent) conv weights carry content — so styles compose across scales without interference. Injecting $w$ only at layer 1 lets later layers' normalizations wash the style out, costing fine-scale control: you get coarse-style-only generation and weak coarse/fine disentanglement.

Purpose of this lecture#

The min-max objective#

Training instability and the JSD failure mode#

Wasserstein GANs#

GAN training dynamics in practice#

Spectral normalization#

Conditional GANs#

Progressive growing and StyleGAN#

Evaluating GANs: Fréchet Inception Distance#

GenAI context: GANs beyond image generation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 3: Generative Adversarial Networks

Purpose of this lecture#

The min-max objective#

Training instability and the JSD failure mode#

Wasserstein GANs#

GAN training dynamics in practice#

Spectral normalization#

Conditional GANs#

Progressive growing and StyleGAN#

Evaluating GANs: Fréchet Inception Distance#

GenAI context: GANs beyond image generation#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#