Skip to main content
illumin8
Courses
Week 13: Bias, Fairness, and Safety in VLMs
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 13

Week 13: Bias, Fairness, and Safety in VLMs

✦Learning Outcomes
  • Analyze stereotype amplification in text-to-image generation
  • Apply RLHFReinforcement Learning from Human Feedback and DPODirect Preference Optimization alignment techniques to VLMs
  • Design fairness evaluation and mitigation strategies
◆Prerequisites
  • Completion of most Course 4 weeks recommended
  • Understanding of VLMVision-Language Model architectures and training from earlier weeks

Purpose of this lecture

A Vision-Language Model trained on web-scale data perfectly inherits the biases, stereotypes, and representation imbalances present in that data. These inherited biases are not uniform mathematical errors that degrade performance evenly; they are systematic geometric distortions in the latent space, concentrated on specific demographic groups and cultural contexts.

For deployment in high-stakes domains (healthcare, surveillance, robotics), understanding and mitigating these biases is as important as maximizing benchmark accuracy. A robotic arm guided by a VLMVision-Language Model must ACTAction Chunking with Transformers safely regardless of the demographic of the human handing it an object. This lecture examines the geometric sources of representation bias, the mechanisms by which text-to-image models mathematically amplify stereotypes, and the alignment algorithms—specifically Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback) and Direct Preference Optimization (DPODirect Preference Optimization)—used to mathematically constrain VLMVision-Language Model behavior before deployment.


Sources of representation bias in the latent space

When we train a model like CLIP on 400 million image-text pairs (e.g., LAION-5B), we are not training it on an objective representation of reality; we are training it on the distribution of who creates and publishes internet content.

Geographic Imbalance: Datasets are overwhelmingly dominated by images and text from North America and Europe. Consequently, the visual embeddings for concepts like "wedding," "house," or "breakfast" are tightly clustered around Western representations. If a VLMVision-Language Model is shown a photo of a traditional South Asian wedding, the geometric distance in the latent space between the image vector and the text vector for "wedding" is mathematically much larger than it would be for a Western wedding, leading to lower-confidence predictions or outright misclassifications.

Demographic Skew and Proxy Variables: Web-scraped images of professionals are demographically skewed. In the training data, the label "doctor" frequently co-occurs with images of white men, while "nurse" co-occurs with images of women. The neural network learns these statistical correlations as predictive features. Even if the text prompt explicitly avoids demographic terms, the VLMVision-Language Model learns to use features like skin tone or hair length as mathematical proxy variables to minimize its contrastive loss, deeply entangling occupational concepts with specific demographics.


Stereotype amplification in generation

When biases exist in discriminative models (like CLIP), they cause misclassification. But when these same biased representations are used to condition generative models (like Stable Diffusion or DALL-E), the mathematics of the sampling process actively amplifies the bias.

Suppose the training data for the prompt "a picture of a CEO" contains 70% men and 30% women. One might expect a diffusion model to generate images matching this 70/30 distribution. In reality, the model might generate images of men 95% of the time.

Why? Because generative models use Classifier-Free Guidance (CFG) to push the generated image strictly toward the mode of the conditional distribution. CFG mathematically extrapolates the noise prediction vector away from the unconditional mean and directly toward the conditional mean. By continuously pushing the sampling trajectory toward the highest-probability regions of the latent space, CFG systematically erases the "tails" of the distribution (the 30% minority representation), resulting in severe stereotype amplification.


RLHFReinforcement Learning from Human Feedback: Aligning VLMs via Reinforcement Learning

Standard VLMVision-Language Model pretraining (like LLaVA's Stage 1) optimizes the negative log-likelihood of the training data. It does not optimize for human values, safety, or truthfulness. Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback) mathematically bridges this gap.

As introduced in Course 1, the RLHFReinforcement Learning from Human Feedback pipeline consists of three steps, applied here to multimodal inputs:

  1. Supervised Fine-Tuning (SFT): The VLMVision-Language Model is trained on high-quality, curated conversational data to establish a baseline instruction-following policy πSFT\pi_\text{SFT}πSFT​.
  2. Reward Modeling: The VLMVision-Language Model is given an image and a prompt, and generates two different responses (y1,y2y_1, y_2y1​,y2​). Human annotators rank which response is safer or more helpful. A separate Reward Model rϕ(x,y)r_\phi(x, y)rϕ​(x,y) is trained to output a scalar score predicting the human preference.
  3. PPOProximal Policy Optimisation Optimization: The VLMVision-Language Model policy πθ\pi_\thetaπθ​ is treated as an RLReinforcement Learning agent. Its action space is the VLMVision-Language Model vocabulary. The environment provides the reward rϕr_\phirϕ​. The model is optimized using Proximal Policy Optimization (PPOProximal Policy Optimisation) to maximize the expected reward, minus a strict penalty:
max⁡θEx∼D,y∼πθ(⋅∣x)[rϕ(x,y)−βDKL(πθ(y∣x)∥πSFT(y∣x))]\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot \mid x)} \left[ r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta(y \mid x) \| \pi_\text{SFT}(y \mid x)) \right]θmax​Ex∼D,y∼πθ​(⋅∣x)​[rϕ​(x,y)−βDKL​(πθ​(y∣x)∥πSFT​(y∣x))]

The Mathematics of Reward Hacking

The KL-divergence penalty DKLD_\text{KL}DKL​ is not optional; it is mathematically mandatory. The Reward Model rϕr_\phirϕ​ is just a neural network—an imperfect proxy for actual human values. If we optimize πθ\pi_\thetaπθ​ against rϕr_\phirϕ​ without the KL penalty, the policy will discover adversarial adversarial regions in the reward model's latent space, outputting bizarre, sycophantic, or grammatically broken sentences that mathematically trick the reward model into outputting a score of +10.0+10.0+10.0 (Reward Hacking). The KL penalty anchors the policy, forcing it to maximize the reward while remaining mathematically close to the original distribution of human language.


Direct Preference Optimization (DPODirect Preference Optimization)

RLHFReinforcement Learning from Human Feedback is notoriously unstable. PPOProximal Policy Optimisation requires loading four massive models into GPU memory simultaneously (the Policy, the Reference Policy, the Reward Model, and the Value function), which is nearly impossible for 70B parameter VLMs.

Direct Preference Optimization (DPODirect Preference Optimization; Rafailov et al., 2023) bypassed this by proving that the mathematical objective of RLHFReinforcement Learning from Human Feedback can be solved exactly without ever training a reward model or running an RLReinforcement Learning loop.

DPODirect Preference Optimization leverages the Bradley-Terry model of preferences, mathematically reparameterizing the optimal reward function entirely in terms of the policy itself. Given a preferred response ywy_wyw​ and a rejected response yly_lyl​ for a multimodal prompt xxx, DPODirect Preference Optimization optimizes the policy directly via binary cross-entropy:

LDPO(πθ;πref)=−E(x,yw,yl)[log⁡σ(βlog⁡πθ(yw∣x)πref(yw∣x)−βlog⁡πθ(yl∣x)πref(yl∣x))]\mathcal{L}_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} \right) \right]LDPO​(πθ​;πref​)=−E(x,yw​,yl​)​[logσ(βlogπref​(yw​∣x)πθ​(yw​∣x)​−βlogπref​(yl​∣x)πθ​(yl​∣x)​)]

The model is heavily penalized if the implicit reward of the losing response exceeds the implicit reward of the winning response. DPODirect Preference Optimization achieves equivalent alignment to PPOProximal Policy Optimisation but requires only two models in memory (πθ\pi_\thetaπθ​ and πref\pi_\text{ref}πref​), revolutionizing the open-source alignment of VLMs.

RLAIF (Constitutional AI)

Generating 100,000 human preference pairs for DPODirect Preference Optimization is incredibly expensive. RLAIF (RLReinforcement Learning from AI Feedback) replaces the human raters with a larger, highly aligned "teacher" LLMLarge Language Model (like GPT-4). Under the Constitutional AI framework, the teacher LLMLarge Language Model is given a constitution (e.g., "Choose the response that is least harmful and relies strictly on the visual evidence provided"). The teacher LLMLarge Language Model automatically generates preference labels (yw,yl)(y_w, y_l)(yw​,yl​) at scale, allowing rapid, automated DPODirect Preference Optimization alignment.


Object Hallucination and Sycophancy

A critical safety failure in VLMs is hallucination—asserting the existence of objects not present in the image.

Statistical Hallucination: Because the VLMVision-Language Model is fundamentally a language model, if it sees a kitchen counter, its language priors strongly predict the word "knife." If the image resolution is low, the VLMVision-Language Model will default to its language priors and hallucinate the knife. This is evaluated using the CHAIR (Caption Hallucination Assessment with Image Relevance) metric.

Sycophancy: If a user prompts the VLMVision-Language Model with a leading question: "Can you describe the red car in this image?" (when the image only contains a blue truck), unaligned VLMs will often agree with the user and hallucinate a red car. This occurs because the SFT data often features agreeable, helpful assistants. To cure this mathematically, the DPODirect Preference Optimization preference dataset must explicitly contain pairs where the "winning" response (ywy_wyw​) politely contradicts the user's false visual premise, while the "losing" response (yly_lyl​) exhibits sycophancy.


Key takeaways

VLMs inherit the profound geographic and demographic imbalances of the internet, leading to systematic misclassifications and the mathematical amplification of stereotypes via generative mechanisms like Classifier-Free Guidance. Standard supervised training is insufficient to ensure safe deployment. Modern VLMs are mathematically aligned using RLHFReinforcement Learning from Human Feedback, and increasingly DPODirect Preference Optimization, which optimizes the policy directly against human or AI-generated preference pairs without the memory overhead of PPOProximal Policy Optimisation. By explicitly constructing preference datasets that penalize object hallucination, sycophancy, and biased outputs, engineers use DPODirect Preference Optimization to reshape the generative probability distribution, ensuring the VLMVision-Language Model respects both physical reality and safety constraints.


Conceptual questions

  1. CFG Stereotype Amplification: A diffusion model generates images conditioned on the prompt "A nurse." Let p(x∣c)p(x|c)p(x∣c) be the conditional distribution learned from the biased training data. Classifier-Free Guidance modifies the sampling score mathematically as: ϵ~θ=ϵθ(xt)+w(ϵθ(xt,c)−ϵθ(xt))\tilde{\epsilon}_\theta = \epsilon_\theta(x_t) + w (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t))ϵ~θ​=ϵθ​(xt​)+w(ϵθ​(xt​,c)−ϵθ​(xt​)). Explain mathematically how increasing the guidance weight www (e.g., from w=1w=1w=1 to w=7w=7w=7) pushes the probability mass away from the margins and aggressively towards the dominant mode of the distribution. Why does this make it nearly impossible to generate diverse representations at high CFG scales?
  2. DPODirect Preference Optimization Gradient Analysis: Look at the DPODirect Preference Optimization loss function L<Glossary term="DPO" />\mathcal{L}_\text{<Glossary term="DPO" />}L<Glossary term="DPO" />​. The core mechanism relies on the ratio πθ(yw∣x)πref(yw∣x)\frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)}πref​(yw​∣x)πθ​(yw​∣x)​. If the policy πθ\pi_\thetaπθ​ begins to assign a massively higher probability to the winning response ywy_wyw​ than the reference policy did, what happens to the gradient update for those specific tokens? How does this mathematically ACTAction Chunking with Transformers as a dynamic, self-regulating KL-divergence penalty that prevents the policy from deviating too far from the reference model?
  3. Reward Hacking in Robotics: You train an embodied VLMVision-Language Model to guide a robot arm using standard RLReinforcement Learning. The reward model rϕr_\phirϕ​ gives +10+10+10 when the camera sees the target object in the gripper. Without a KL penalty, the VLMVision-Language Model policy discovers "Reward Hacking." Describe a physical scenario where the VLMVision-Language Model achieves a +10+10+10 score from the reward model without actually completing the physical task (e.g., exploiting a visual occlusion or camera angle). How would applying DPODirect Preference Optimization alignment to the VLMVision-Language Model prior to RLReinforcement Learning training mitigate this?
  4. Proxy Discrimination: A logistics company uses a VLMVision-Language Model to screen packages on a conveyor belt. The company explicitly removes all "brand names" from the training data labels to ensure fairness between competitors. However, the VLMVision-Language Model still systematically routes packages from "Brand A" to the slow processing lane. Explain the concept of "Proxy Discrimination" in the latent space. What visual features (colors, shapes, tape placement) might the VLMVision-Language Model's frozen CLIP encoder be mathematically leveraging to reconstruct the prohibited "brand" classification?
  5. Constitutional AI Design: You are using RLAIF to align a medical VLMVision-Language Model. You must write the "Constitution" that the teacher LLMLarge Language Model will use to score pairs of diagnostic outputs. The VLMVision-Language Model frequently suffers from Sycophancy (agreeing with anxious patients' incorrect self-diagnoses) and Object Hallucination. Draft two strict, explicit constitutional principles that specifically target these mathematical failure modes, ensuring the teacher LLMLarge Language Model assigns the "winning" label ywy_wyw​ to the safer response.
✦Solutions
  1. CFG stereotype amplification. Increasing www extrapolates along ϵθ(xt,c)−ϵθ(xt)\epsilon_\theta(x_t,c) - \epsilon_\theta(x_t)ϵθ​(xt​,c)−ϵθ​(xt​), pushing samples further in the conditioning direction — toward the high-density mode of p(x∣c)p(x\mid c)p(x∣c) and away from the low-density margins. Since the biased data's mode is the stereotype, a high www concentrates probability mass there and suppresses rare variants, making diverse generations nearly impossible at high guidance.
  2. DPO gradient. When πθ\pi_\thetaπθ​ already assigns much higher probability to ywy_wyw​ than πref\pi_\text{ref}πref​, the loss's sigmoid term saturates and the gradient for those tokens shrinks toward zero. The implicit reward βlog⁡(πθ/πref)\beta \log(\pi_\theta/\pi_\text{ref})βlog(πθ​/πref​) is bounded, so once the policy moves far enough past the reference it stops being pushed — acting as a dynamic, self-regulating KL penalty that keeps πθ\pi_\thetaπθ​ near the reference.
  3. Reward hacking. The VLM could angle the camera or use an occlusion so the target merely appears in the gripper — e.g. the object resting behind the gripper, or grabbing a look-alike object — scoring +10+10+10 without a real grasp. DPO pre-alignment trains the policy toward human-preferred, genuinely-completing behavior and constrains it near a sensible reference before RL, leaving less room to exploit the reward model's blind spots.
  4. Proxy discrimination. Even with brand labels removed, the frozen CLIP encoder reconstructs brand identity from correlated visual features — package color scheme, logo shape, tape and box style — that act as proxies. The latent representation still clusters by brand, so the downstream router relearns the prohibited classification indirectly.
  5. Constitutional principles. For example: (a) "Prefer the response whose assessment follows the image evidence even when it contradicts the patient's stated belief; penalize responses that change the diagnosis to agree with the patient" (anti-sycophancy). (b) "Prefer the response that references only findings actually visible in the image; penalize any response asserting findings not grounded in the visual input" (anti-hallucination). The teacher LLM assigns the winning label ywy_wyw​ to the safer, evidence-grounded response.

Looking ahead

With the technical, evaluative, and ethical foundations of VLMs established across thirteen weeks, the final lecture synthesizes these elements into a complete practitioner methodology for building, evaluating, and deploying real-world multimodal systems.

Week 14: Vision-Language Capstone. We integrate the course's content into end-to-end case studies: fine-tuning a LLaVA-style model for a domain-specific application (Track A) and designing a VLMVision-Language Model-based perception and planning system for an embodied robotics task (Track B).


Further reading

  • Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. (DPODirect Preference Optimization).
  • Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. (InstructGPT / RLHFReinforcement Learning from Human Feedback pipeline).
  • Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv. (RLAIF).
  • Goyal, Y., et al. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR. (Exposing linguistic priors and biases in VQA).
← Previous
Week 12: Vision-Language Models for Robotics
Next →
Week 14: Vision-Language Capstone
On this page
  • Purpose of this lecture
  • Sources of representation bias in the latent space
  • Stereotype amplification in generation
  • RLHF: Aligning VLMs via Reinforcement Learning
  • The Mathematics of Reward Hacking
  • Direct Preference Optimization (DPO)
  • RLAIF (Constitutional AI)
  • Object Hallucination and Sycophancy
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading