Purpose of this lecture
Generative adversarial networks (GANs; Goodfellow et al., 2014) produce the sharpest image samples of any generative model family but are also the most difficult to train. Understanding GANs deeply requires understanding the divergence geometry that underlies their objective, the mode collapse and training instability failures that follow from that geometry, and the Wasserstein reformulation that partially resolves both. GAN training dynamics also provide the conceptual foundation for understanding adversarial examples, discriminative fine-tuning of diffusion models, and RLHFReinforcement Learning from Human Feedback for image generation.
The min-max objective
A GAN consists of a generator that maps noise to samples, and a discriminator (or critic) that classifies inputs as real (from ) or generated (from ). The training objective is:
The discriminator maximizes — it wants to assign high to real data and low to generated samples. The generator minimizes — it wants to be high, fooling the discriminator.
The optimal discriminator: for fixed , the discriminator that maximizes is:
This is the density ratio. The derivation: treating as a functional integral over , the integrand at each is . Setting the derivative with respect to to zero gives , yielding the expression above.
Substituting into recovers the Jensen-Shannon divergence. Substituting:
Add and subtract inside each expectation:
The mixture distribution appears in both KL terms:
where the Jensen-Shannon divergence is with . At the global optimum , the JSD equals zero and . GAN training thus minimizes the JSD between the data and generator distributions — but only when the discriminator is simultaneously trained to optimality at each generator update step.
Training instability and the JSD failure mode
The JSD interpretation reveals a fundamental difficulty. When and have disjoint support — which is likely in high dimensions, where both distributions concentrate on low-dimensional manifolds — the JSD equals its maximum value regardless of the distance between the manifolds. The gradient of JSD with respect to the generator parameters is therefore zero almost everywhere: the discriminator can perfectly classify real from generated samples, and the generator receives no useful training signal.
In practice, early GAN training alternates between discriminator and generator updates without waiting for the discriminator to converge. This keeps the discriminator imperfect, providing non-zero gradients. But the balance is delicate: an overly strong discriminator saturates its output to 0 for all generated samples, zeroing gradients; an overly weak discriminator provides misleading signal. This tension is the root cause of GAN training instability.
Mode collapse is the failure mode where the generator produces only a subset of the data distribution's modes. If learns to produce samples that fool the current discriminator, and the discriminator then updates to reject them, the generator may shift to a different mode — cycling through modes without converging. The JSD objective provides no penalty for missing modes because the JSD only measures how distinguishable is from , not how many modes of are covered.
Wasserstein GANs
Wasserstein GANs (WGAN; Arjovsky et al., 2017) replace the JSD with the Wasserstein-1 (Earth Mover's) distance, which has gradients even when the distributions have disjoint support:
where the supremum is over 1-Lipschitz functions . The Wasserstein distance measures the minimum "work" required to transform into , treating the distribution as a mass that must be transported. Unlike JSD, is continuous and differentiable in even when the supports are disjoint, providing meaningful gradients throughout training.
Geometric intuition: why EMD provides gradients while JSD does not. Consider two 1D distributions: (mass at 0) and (mass at ). The JSD between any two non-overlapping distributions equals regardless of : the gradient for all . The generator gets no signal about which direction to move. The Wasserstein-1 distance between these same distributions is , with gradient — a constant signal pointing toward everywhere. For continuous distributions with partially overlapping supports, the contrast is less extreme but the principle holds: the EMD counts "how far" the mass must move, while the JSD counts only "whether" the distributions overlap. In high dimensions, where data and generator distributions concentrate on disjoint low-dimensional manifolds, the EMD advantage is essentially universal throughout training.
The WGAN critic (no longer a binary classifier) approximates the 1-Lipschitz function that maximizes the expected difference. The 1-Lipschitz constraint is enforced either by weight clipping (WGAN original: clip all critic weights to ) or gradient penalty (WGAN-GP: add evaluated at interpolations between real and generated samples).
WGAN-GP is the standard WGAN variant:
where for . This produces more stable training and better mode coverage than standard GAN training, at the cost of reduced peak sample quality.
GAN training dynamics in practice
Practical GAN training requires balancing the discriminator and generator update rates carefully. Several heuristics have become standard.
Number of critic steps per generator step: in WGAN, the critic should be near-optimal at each generator step, which requires training it for steps per generator step. Standard GAN training uses . With only one critic step, the discriminator may be too weak, providing misleading gradients; with too many, the discriminator may saturate (outputting near-0/1 everywhere), also killing gradients.
Learning rates and optimizer choices: GANs are notoriously sensitive to learning rate. The standard modern recommendation is Adam with (no first-moment momentum) and , with separate learning rates for generator () and discriminator (). The asymmetric learning rates reflect that the discriminator task is often easier than the generator task. Using rather than the default reduces gradient oscillations that arise from the adversarial dynamics.
Minibatch discrimination: a collapse diagnostic that has become a standard network component. The discriminator receives information about the batch statistics (diversity across the mini-batch) in addition to individual samples, making it detect when the generator is producing identical samples. Without this, a generator can collapse to a single point that achieves low loss because the single-sample discriminator has no way to penalize reduced diversity.
Two-timescale update rule (TTUR): provides a theoretical foundation for separate learning rates by showing that the GAN game converges to a local Nash equilibrium when the discriminator uses a larger learning rate than the generator (converging faster to its local optimal response), analogous to the two-timescale asymptotic theory in actor-critic RLReinforcement Learning. This connection to the dual timescale convergence analysis from actor-critic methods (Course 1) is not coincidental — both are saddle-point optimization problems where stability requires one player to adapt faster than the other.
Spectral normalization
Spectral normalization (Miyato et al., 2018) enforces the Lipschitz constraint on the discriminator by dividing each weight matrix by its spectral norm (largest singular value ):
This ensures that the product of spectral norms across all layers bounds the Lipschitz constant of the full network. Spectral normalization is computationally cheaper than gradient penalty and has become the default stabilization method in modern GAN architectures (BigGAN, StyleGAN).
Conditional GANs
Conditional GANs (cGANs) augment both generator and discriminator with a conditioning signal (class label, text description, or image):
The generator learns to produce samples that match conditioning ; the discriminator learns to assess whether is a plausible sample for the given , not merely whether looks real. Conditioning can be injected through concatenation, conditional batch normalization, or cross-attention in the generator and discriminator.
Projection discriminator (Miyato and Koyama, 2018) injects label information through an inner product where is a learned class embedding and is a learned scalar. This separates the class-conditional and class-unconditional components of the discriminator.
Progressive growing and StyleGAN
Progressive GAN (Karras et al., 2018) trains GANs at increasing resolutions, starting at and progressively adding layers for higher resolution. This stabilizes training because low-resolution stages are easy and provide well-behaved gradients.
StyleGAN (Karras et al., 2019) introduces an explicit style latent (a learned mapping of the noise) injected into each layer through adaptive instance normalization (AdaIN), enabling disentangled control over coarse (pose, shape) and fine (texture, color) attributes. StyleGAN produces the sharpest samples of any GAN and provides the most interpretable latent space — directions in -space correspond to semantic edits.
Evaluating GANs: Fréchet Inception Distance
Unlike likelihood-based models, GANs cannot be evaluated by test-set log-likelihood. The standard evaluation metric is Fréchet Inception Distance (FID; Heusel et al., 2017), which measures the distance between the feature distributions of real and generated images using statistics extracted from a pretrained InceptionV3 network.
Let be the mean and covariance of InceptionV3 features over real images, and over generated images. The FID is:
Lower FID is better: zero would mean the feature distributions are identical. FID is sensitive to both sample quality (incorrect feature statistics) and mode coverage (missing modes shift the mean). It does not decompose these contributions, which is why Precision and Recall metrics (Kynkäänniemi et al., 2019) are increasingly used alongside FID: Precision measures the fraction of generated samples that fall within the real distribution's support (sample quality), while Recall measures the fraction of real data covered by the generated distribution (mode coverage). A model with high Precision but low Recall is mode-dropping; a model with high Recall but low Precision is generating blurry or unrealistic samples.
GenAI context: GANs beyond image generation
The GAN framework's influence extends well beyond image generation. RLHFReinforcement Learning from Human Feedback for language models is structurally analogous: a reward model trained on human preferences acts as the discriminator, and policy gradient updates ACTAction Chunking with Transformers as the generator update — the policy is trained to produce outputs that the reward model rates highly, just as the generator is trained to produce samples that the discriminator cannot distinguish from real data. The divergence minimization connection holds too: RLHFReinforcement Learning from Human Feedback with KL penalty minimizes reverse KL between the policy and a reference, which parallels the mode-seeking behavior of GAN training.
The adversarial game also appears in diffusion model distillation: ADD (Sauer et al., 2023) uses a discriminator to provide gradients that sharpen the samples of a single-step distilled diffusion model, recovering GAN-level sharpness while retaining the stability of diffusion-based training. This hybrid approach — using the discriminator's gradient signal without using the JSD objective end-to-end — represents the current state of the art for fast, high-quality image generation. In robotic imitation learning, adversarial IRL (Gail, Ho and Ermon, 2016) applies the GAN framework directly to trajectory matching: the discriminator distinguishes expert from policy trajectories, and the policy is trained with the discriminator's signal as a reward — eliminating the need for manual reward design.
Key takeaways
The GAN objective minimizes the Jensen-Shannon divergence between the data and generator distributions when the discriminator is trained to optimality. JSD is zero when distributions have disjoint support, causing vanishing generator gradients and mode collapse. Wasserstein GANs replace JSD with the earth mover's distance, which provides meaningful gradients for non-overlapping distributions and improves mode coverage. The Lipschitz constraint required by the Wasserstein formulation is enforced by gradient penalty (WGAN-GP) or spectral normalization. Conditional GANs inject class or text conditioning into both generator and discriminator. StyleGAN's -space mapping disentangles style attributes and enables semantic image editing.
Conceptual questions
-
Derive that the GAN value function with the optimal discriminator equals . Then show that when and have disjoint support, regardless of the geometric distance between the distributions. Why does this imply that the generator's gradient is zero in this regime?
-
The Wasserstein-1 distance between two Gaussians and in equals . Show this using the optimal transport interpretation. How does this compare to the KL divergence between the same two Gaussians, and what does this imply about which distance provides better gradients early in GAN training?
-
A WGAN-GP critic is trained on images from two modes of a distribution: one mode at high contrast and one at low contrast. After training, the generator produces only high-contrast images (mode collapse). Analyze whether WGAN-GP should theoretically prevent this behavior, and if it does not, identify the failure mechanism. What modification to the training procedure would encourage the generator to cover both modes?
-
Spectral normalization divides each weight matrix by its spectral norm to enforce Lipschitz continuity. If a discriminator has layers each with spectral norm exactly 1 after normalization, show that the Lipschitz constant of the full network is bounded by . Now suppose during training, one layer's weight matrix develops a spectral norm of 1.5 before renormalization at the next gradient step. What is the actual Lipschitz constant of the network during this interval, and does it satisfy the WGAN requirement?
-
StyleGAN injects style codes into each layer via adaptive instance normalization: the feature map at each layer is normalized and then rescaled by derived from . Explain how this allows the same generator architecture to produce images spanning many styles without catastrophic interference between styles. What failure mode would you expect if were injected only at the first layer rather than at every layer?
Looking ahead
GANs produce sharp samples but suffer from training instability and mode collapse. The next two model families take a completely different approach: defining distributions through explicit energy functions or score functions rather than implicit generators.
Week 4: Energy-Based Models and Score Matching. We examine how to define , why the partition function makes direct MLE intractable, and how score matching and denoising score matching enable learning without computing normalizing constants.
Further reading
- Goodfellow, I. J., et al. (2014). Generative Adversarial Nets. NeurIPS. (The seminal GAN paper).
- Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ICML. (Theoretical resolution of mode collapse and vanishing gradients).
- Karras, T., Laine, S., & Aila, T. (2019). A Style-Based Generator Architecture for Generative Adversarial Networks (StyleGAN). CVPR.
- Gulrajani, I., et al. (2017). Improved Training of Wasserstein GANs (WGAN-GP). NeurIPS.
- Heusel, M., et al. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium (FID, TTUR). NeurIPS.