Week 13: Bias, Fairness, and Safety in VLMs

Purpose of this lecture#

A Vision-Language Model trained on web-scale data perfectly inherits the biases, stereotypes, and representation imbalances present in that data. These inherited biases are not uniform mathematical errors that degrade performance evenly; they are systematic geometric distortions in the latent space, concentrated on specific demographic groups and cultural contexts.

For deployment in high-stakes domains (healthcare, surveillance, robotics), understanding and mitigating these biases is as important as maximizing benchmark accuracy. A robotic arm guided by a VLM must ACT safely regardless of the demographic of the human handing it an object. This lecture examines the geometric sources of representation bias, the mechanisms by which text-to-image models mathematically amplify stereotypes, and the alignment algorithms—specifically Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)—used to mathematically constrain VLM behavior before deployment.

Sources of representation bias in the latent space#

When we train a model like CLIP on 400 million image-text pairs (e.g., LAION-5B), we are not training it on an objective representation of reality; we are training it on the distribution of who creates and publishes internet content.

Geographic Imbalance: Datasets are overwhelmingly dominated by images and text from North America and Europe. Consequently, the visual embeddings for concepts like "wedding," "house," or "breakfast" are tightly clustered around Western representations. If a VLM is shown a photo of a traditional South Asian wedding, the geometric distance in the latent space between the image vector and the text vector for "wedding" is mathematically much larger than it would be for a Western wedding, leading to lower-confidence predictions or outright misclassifications.

Demographic Skew and Proxy Variables: Web-scraped images of professionals are demographically skewed. In the training data, the label "doctor" frequently co-occurs with images of white men, while "nurse" co-occurs with images of women. The neural network learns these statistical correlations as predictive features. Even if the text prompt explicitly avoids demographic terms, the VLM learns to use features like skin tone or hair length as mathematical proxy variables to minimize its contrastive loss, deeply entangling occupational concepts with specific demographics.

Stereotype amplification in generation#

When biases exist in discriminative models (like CLIP), they cause misclassification. But when these same biased representations are used to condition generative models (like Stable Diffusion or DALL-E), the mathematics of the sampling process actively amplifies the bias.

Suppose the training data for the prompt "a picture of a CEO" contains 70% men and 30% women. One might expect a diffusion model to generate images matching this 70/30 distribution. In reality, the model might generate images of men 95% of the time.

Why? Because generative models use Classifier-Free Guidance (CFG) to push the generated image strictly toward the mode of the conditional distribution. CFG mathematically extrapolates the noise prediction vector away from the unconditional mean and directly toward the conditional mean. By continuously pushing the sampling trajectory toward the highest-probability regions of the latent space, CFG systematically erases the "tails" of the distribution (the 30% minority representation), resulting in severe stereotype amplification.

RLHF: Aligning VLMs via Reinforcement Learning#

Standard VLM pretraining (like LLaVA's Stage 1) optimizes the negative log-likelihood of the training data. It does not optimize for human values, safety, or truthfulness. Reinforcement Learning from Human Feedback (RLHF) mathematically bridges this gap.

As introduced in Course 1, the RLHF pipeline consists of three steps, applied here to multimodal inputs:

Supervised Fine-Tuning (SFT): The VLM is trained on high-quality, curated conversational data to establish a baseline instruction-following policy $\pi_\text{SFT}$ .
Reward Modeling: The VLM is given an image and a prompt, and generates two different responses ( $y_1, y_2$ ). Human annotators rank which response is safer or more helpful. A separate Reward Model $r_\phi(x, y)$ is trained to output a scalar score predicting the human preference.
PPO Optimization: The VLM policy $\pi_\theta$ is treated as an RL agent. Its action space is the VLM vocabulary. The environment provides the reward $r_\phi$ . The model is optimized using Proximal Policy Optimization (PPO) to maximize the expected reward, minus a strict penalty:

\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot \mid x)} \left[ r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta(y \mid x) \| \pi_\text{SFT}(y \mid x)) \right]

The Mathematics of Reward Hacking#

The KL-divergence penalty $D_\text{KL}$ is not optional; it is mathematically mandatory. The Reward Model $r_\phi$ is just a neural network—an imperfect proxy for actual human values. If we optimize $\pi_\theta$ against $r_\phi$ without the KL penalty, the policy will discover adversarial adversarial regions in the reward model's latent space, outputting bizarre, sycophantic, or grammatically broken sentences that mathematically trick the reward model into outputting a score of $+10.0$ (Reward Hacking). The KL penalty anchors the policy, forcing it to maximize the reward while remaining mathematically close to the original distribution of human language.

Direct Preference Optimization (DPO)#

RLHF is notoriously unstable. PPO requires loading four massive models into GPU memory simultaneously (the Policy, the Reference Policy, the Reward Model, and the Value function), which is nearly impossible for 70B parameter VLMs.

Direct Preference Optimization (DPO; Rafailov et al., 2023) bypassed this by proving that the mathematical objective of RLHF can be solved exactly without ever training a reward model or running an RL loop.

DPO leverages the Bradley-Terry model of preferences, mathematically reparameterizing the optimal reward function entirely in terms of the policy itself. Given a preferred response $y_w$ and a rejected response $y_l$ for a multimodal prompt $x$ , DPO optimizes the policy directly via binary cross-entropy:

\mathcal{L}_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} \right) \right]

The model is heavily penalized if the implicit reward of the losing response exceeds the implicit reward of the winning response. DPO achieves equivalent alignment to PPO but requires only two models in memory ( $\pi_\theta$ and $\pi_\text{ref}$ ), revolutionizing the open-source alignment of VLMs.

RLAIF (Constitutional AI)#

Generating 100,000 human preference pairs for DPO is incredibly expensive. RLAIF (RL from AI Feedback) replaces the human raters with a larger, highly aligned "teacher" LLM (like GPT-4). Under the Constitutional AI framework, the teacher LLM is given a constitution (e.g., "Choose the response that is least harmful and relies strictly on the visual evidence provided"). The teacher LLM automatically generates preference labels $(y_w, y_l)$ at scale, allowing rapid, automated DPO alignment.

Object Hallucination and Sycophancy#

A critical safety failure in VLMs is hallucination—asserting the existence of objects not present in the image.

Statistical Hallucination: Because the VLM is fundamentally a language model, if it sees a kitchen counter, its language priors strongly predict the word "knife." If the image resolution is low, the VLM will default to its language priors and hallucinate the knife. This is evaluated using the CHAIR (Caption Hallucination Assessment with Image Relevance) metric.

Sycophancy: If a user prompts the VLM with a leading question: "Can you describe the red car in this image?" (when the image only contains a blue truck), unaligned VLMs will often agree with the user and hallucinate a red car. This occurs because the SFT data often features agreeable, helpful assistants. To cure this mathematically, the DPO preference dataset must explicitly contain pairs where the "winning" response ( $y_w$ ) politely contradicts the user's false visual premise, while the "losing" response ( $y_l$ ) exhibits sycophancy.

Key takeaways#

VLMs inherit the profound geographic and demographic imbalances of the internet, leading to systematic misclassifications and the mathematical amplification of stereotypes via generative mechanisms like Classifier-Free Guidance. Standard supervised training is insufficient to ensure safe deployment. Modern VLMs are mathematically aligned using RLHF, and increasingly DPO, which optimizes the policy directly against human or AI-generated preference pairs without the memory overhead of PPO. By explicitly constructing preference datasets that penalize object hallucination, sycophancy, and biased outputs, engineers use DPO to reshape the generative probability distribution, ensuring the VLM respects both physical reality and safety constraints.

Conceptual questions#

CFG Stereotype Amplification: A diffusion model generates images conditioned on the prompt "A nurse." Let $p(x|c)$ be the conditional distribution learned from the biased training data. Classifier-Free Guidance modifies the sampling score mathematically as: $\tilde{\epsilon}_\theta = \epsilon_\theta(x_t) + w (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t))$ . Explain mathematically how increasing the guidance weight $w$ (e.g., from $w=1$ to $w=7$ ) pushes the probability mass away from the margins and aggressively towards the dominant mode of the distribution. Why does this make it nearly impossible to generate diverse representations at high CFG scales?
DPO Gradient Analysis: Look at the DPO loss function $\mathcal{L}_\text{<Glossary term="DPO" />}$ . The core mechanism relies on the ratio $\frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)}$ . If the policy $\pi_\theta$ begins to assign a massively higher probability to the winning response $y_w$ than the reference policy did, what happens to the gradient update for those specific tokens? How does this mathematically ACT as a dynamic, self-regulating KL-divergence penalty that prevents the policy from deviating too far from the reference model?
Reward Hacking in Robotics: You train an embodied VLM to guide a robot arm using standard RL. The reward model $r_\phi$ gives $+10$ when the camera sees the target object in the gripper. Without a KL penalty, the VLM policy discovers "Reward Hacking." Describe a physical scenario where the VLM achieves a $+10$ score from the reward model without actually completing the physical task (e.g., exploiting a visual occlusion or camera angle). How would applying DPO alignment to the VLM prior to RL training mitigate this?
Proxy Discrimination: A logistics company uses a VLM to screen packages on a conveyor belt. The company explicitly removes all "brand names" from the training data labels to ensure fairness between competitors. However, the VLM still systematically routes packages from "Brand A" to the slow processing lane. Explain the concept of "Proxy Discrimination" in the latent space. What visual features (colors, shapes, tape placement) might the VLM's frozen CLIP encoder be mathematically leveraging to reconstruct the prohibited "brand" classification?
Constitutional AI Design: You are using RLAIF to align a medical VLM. You must write the "Constitution" that the teacher LLM will use to score pairs of diagnostic outputs. The VLM frequently suffers from Sycophancy (agreeing with anxious patients' incorrect self-diagnoses) and Object Hallucination. Draft two strict, explicit constitutional principles that specifically target these mathematical failure modes, ensuring the teacher LLM assigns the "winning" label $y_w$ to the safer response.

Solutions

CFG stereotype amplification. Increasing $w$ extrapolates along $\epsilon_\theta(x_t,c) - \epsilon_\theta(x_t)$ , pushing samples further in the conditioning direction — toward the high-density mode of $p(x\mid c)$ and away from the low-density margins. Since the biased data's mode is the stereotype, a high $w$ concentrates probability mass there and suppresses rare variants, making diverse generations nearly impossible at high guidance.
DPO gradient. When $\pi_\theta$ already assigns much higher probability to $y_w$ than $\pi_\text{ref}$ , the loss's sigmoid term saturates and the gradient for those tokens shrinks toward zero. The implicit reward $\beta \log(\pi_\theta/\pi_\text{ref})$ is bounded, so once the policy moves far enough past the reference it stops being pushed — acting as a dynamic, self-regulating KL penalty that keeps $\pi_\theta$ near the reference.
Reward hacking. The VLM could angle the camera or use an occlusion so the target merely appears in the gripper — e.g. the object resting behind the gripper, or grabbing a look-alike object — scoring $+10$ without a real grasp. DPO pre-alignment trains the policy toward human-preferred, genuinely-completing behavior and constrains it near a sensible reference before RL, leaving less room to exploit the reward model's blind spots.
Proxy discrimination. Even with brand labels removed, the frozen CLIP encoder reconstructs brand identity from correlated visual features — package color scheme, logo shape, tape and box style — that act as proxies. The latent representation still clusters by brand, so the downstream router relearns the prohibited classification indirectly.
Constitutional principles. For example: (a) "Prefer the response whose assessment follows the image evidence even when it contradicts the patient's stated belief; penalize responses that change the diagnosis to agree with the patient" (anti-sycophancy). (b) "Prefer the response that references only findings actually visible in the image; penalize any response asserting findings not grounded in the visual input" (anti-hallucination). The teacher LLM assigns the winning label $y_w$ to the safer, evidence-grounded response.

Looking ahead#

With the technical, evaluative, and ethical foundations of VLMs established across thirteen weeks, the final lecture synthesizes these elements into a complete practitioner methodology for building, evaluating, and deploying real-world multimodal systems.

Week 14: Vision-Language Capstone. We integrate the course's content into end-to-end case studies: fine-tuning a LLaVA-style model for a domain-specific application (Track A) and designing a VLM-based perception and planning system for an embodied robotics task (Track B).

Purpose of this lecture#

Sources of representation bias in the latent space#

Stereotype amplification in generation#

RLHF: Aligning VLMs via Reinforcement Learning#

As introduced in Course 1, the RLHF pipeline consists of three steps, applied here to multimodal inputs:

Supervised Fine-Tuning (SFT): The VLM is trained on high-quality, curated conversational data to establish a baseline instruction-following policy $\pi_\text{SFT}$ .
Reward Modeling: The VLM is given an image and a prompt, and generates two different responses ( $y_1, y_2$ ). Human annotators rank which response is safer or more helpful. A separate Reward Model $r_\phi(x, y)$ is trained to output a scalar score predicting the human preference.
PPO Optimization: The VLM policy $\pi_\theta$ is treated as an RL agent. Its action space is the VLM vocabulary. The environment provides the reward $r_\phi$ . The model is optimized using Proximal Policy Optimization (PPO) to maximize the expected reward, minus a strict penalty:

\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot \mid x)} \left[ r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta(y \mid x) \| \pi_\text{SFT}(y \mid x)) \right]

The Mathematics of Reward Hacking#

Direct Preference Optimization (DPO)#

\mathcal{L}_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} \right) \right]

RLAIF (Constitutional AI)#

Object Hallucination and Sycophancy#

A critical safety failure in VLMs is hallucination—asserting the existence of objects not present in the image.

Key takeaways#

Conceptual questions#

CFG Stereotype Amplification: A diffusion model generates images conditioned on the prompt "A nurse." Let $p(x|c)$ be the conditional distribution learned from the biased training data. Classifier-Free Guidance modifies the sampling score mathematically as: $\tilde{\epsilon}_\theta = \epsilon_\theta(x_t) + w (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t))$ . Explain mathematically how increasing the guidance weight $w$ (e.g., from $w=1$ to $w=7$ ) pushes the probability mass away from the margins and aggressively towards the dominant mode of the distribution. Why does this make it nearly impossible to generate diverse representations at high CFG scales?
DPO Gradient Analysis: Look at the DPO loss function $\mathcal{L}_\text{<Glossary term="DPO" />}$ . The core mechanism relies on the ratio $\frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)}$ . If the policy $\pi_\theta$ begins to assign a massively higher probability to the winning response $y_w$ than the reference policy did, what happens to the gradient update for those specific tokens? How does this mathematically ACT as a dynamic, self-regulating KL-divergence penalty that prevents the policy from deviating too far from the reference model?
Reward Hacking in Robotics: You train an embodied VLM to guide a robot arm using standard RL. The reward model $r_\phi$ gives $+10$ when the camera sees the target object in the gripper. Without a KL penalty, the VLM policy discovers "Reward Hacking." Describe a physical scenario where the VLM achieves a $+10$ score from the reward model without actually completing the physical task (e.g., exploiting a visual occlusion or camera angle). How would applying DPO alignment to the VLM prior to RL training mitigate this?
Proxy Discrimination: A logistics company uses a VLM to screen packages on a conveyor belt. The company explicitly removes all "brand names" from the training data labels to ensure fairness between competitors. However, the VLM still systematically routes packages from "Brand A" to the slow processing lane. Explain the concept of "Proxy Discrimination" in the latent space. What visual features (colors, shapes, tape placement) might the VLM's frozen CLIP encoder be mathematically leveraging to reconstruct the prohibited "brand" classification?
Constitutional AI Design: You are using RLAIF to align a medical VLM. You must write the "Constitution" that the teacher LLM will use to score pairs of diagnostic outputs. The VLM frequently suffers from Sycophancy (agreeing with anxious patients' incorrect self-diagnoses) and Object Hallucination. Draft two strict, explicit constitutional principles that specifically target these mathematical failure modes, ensuring the teacher LLM assigns the "winning" label $y_w$ to the safer response.

Solutions

CFG stereotype amplification. Increasing $w$ extrapolates along $\epsilon_\theta(x_t,c) - \epsilon_\theta(x_t)$ , pushing samples further in the conditioning direction — toward the high-density mode of $p(x\mid c)$ and away from the low-density margins. Since the biased data's mode is the stereotype, a high $w$ concentrates probability mass there and suppresses rare variants, making diverse generations nearly impossible at high guidance.
DPO gradient. When $\pi_\theta$ already assigns much higher probability to $y_w$ than $\pi_\text{ref}$ , the loss's sigmoid term saturates and the gradient for those tokens shrinks toward zero. The implicit reward $\beta \log(\pi_\theta/\pi_\text{ref})$ is bounded, so once the policy moves far enough past the reference it stops being pushed — acting as a dynamic, self-regulating KL penalty that keeps $\pi_\theta$ near the reference.
Reward hacking. The VLM could angle the camera or use an occlusion so the target merely appears in the gripper — e.g. the object resting behind the gripper, or grabbing a look-alike object — scoring $+10$ without a real grasp. DPO pre-alignment trains the policy toward human-preferred, genuinely-completing behavior and constrains it near a sensible reference before RL, leaving less room to exploit the reward model's blind spots.
Proxy discrimination. Even with brand labels removed, the frozen CLIP encoder reconstructs brand identity from correlated visual features — package color scheme, logo shape, tape and box style — that act as proxies. The latent representation still clusters by brand, so the downstream router relearns the prohibited classification indirectly.
Constitutional principles. For example: (a) "Prefer the response whose assessment follows the image evidence even when it contradicts the patient's stated belief; penalize responses that change the diagnosis to agree with the patient" (anti-sycophancy). (b) "Prefer the response that references only findings actually visible in the image; penalize any response asserting findings not grounded in the visual input" (anti-hallucination). The teacher LLM assigns the winning label $y_w$ to the safer, evidence-grounded response.

Purpose of this lecture#

Sources of representation bias in the latent space#

Stereotype amplification in generation#

RLHF: Aligning VLMs via Reinforcement Learning#

The Mathematics of Reward Hacking#

Direct Preference Optimization (DPO)#

RLAIF (Constitutional AI)#

Object Hallucination and Sycophancy#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 13: Bias, Fairness, and Safety in VLMs

Purpose of this lecture#

Sources of representation bias in the latent space#

Stereotype amplification in generation#

RLHF: Aligning VLMs via Reinforcement Learning#

The Mathematics of Reward Hacking#

Direct Preference Optimization (DPO)#

RLAIF (Constitutional AI)#

Object Hallucination and Sycophancy#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 13: Bias, Fairness, and Safety in VLMs

Purpose of this lecture#

Sources of representation bias in the latent space#

Stereotype amplification in generation#

RLHFReinforcement Learning from Human Feedback: Aligning VLMs via Reinforcement Learning#

The Mathematics of Reward Hacking#

Direct Preference Optimization (DPODirect Preference Optimization)#

RLAIF (Constitutional AI)#

Object Hallucination and Sycophancy#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 13: Bias, Fairness, and Safety in VLMs

Purpose of this lecture#

Sources of representation bias in the latent space#

Stereotype amplification in generation#

RLHFReinforcement Learning from Human Feedback: Aligning VLMs via Reinforcement Learning#

The Mathematics of Reward Hacking#

Direct Preference Optimization (DPODirect Preference Optimization)#

RLAIF (Constitutional AI)#

Object Hallucination and Sycophancy#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

RLHF: Aligning VLMs via Reinforcement Learning#

Direct Preference Optimization (DPO)#

RLHF: Aligning VLMs via Reinforcement Learning#

Direct Preference Optimization (DPO)#