Purpose of this lecture
A Vision-Language Model trained on web-scale data perfectly inherits the biases, stereotypes, and representation imbalances present in that data. These inherited biases are not uniform mathematical errors that degrade performance evenly; they are systematic geometric distortions in the latent space, concentrated on specific demographic groups and cultural contexts.
For deployment in high-stakes domains (healthcare, surveillance, robotics), understanding and mitigating these biases is as important as maximizing benchmark accuracy. A robotic arm guided by a VLMVision-Language Model must ACTAction Chunking with Transformers safely regardless of the demographic of the human handing it an object. This lecture examines the geometric sources of representation bias, the mechanisms by which text-to-image models mathematically amplify stereotypes, and the alignment algorithms—specifically Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback) and Direct Preference Optimization (DPODirect Preference Optimization)—used to mathematically constrain VLMVision-Language Model behavior before deployment.
Sources of representation bias in the latent space
When we train a model like CLIP on 400 million image-text pairs (e.g., LAION-5B), we are not training it on an objective representation of reality; we are training it on the distribution of who creates and publishes internet content.
Geographic Imbalance: Datasets are overwhelmingly dominated by images and text from North America and Europe. Consequently, the visual embeddings for concepts like "wedding," "house," or "breakfast" are tightly clustered around Western representations. If a VLMVision-Language Model is shown a photo of a traditional South Asian wedding, the geometric distance in the latent space between the image vector and the text vector for "wedding" is mathematically much larger than it would be for a Western wedding, leading to lower-confidence predictions or outright misclassifications.
Demographic Skew and Proxy Variables: Web-scraped images of professionals are demographically skewed. In the training data, the label "doctor" frequently co-occurs with images of white men, while "nurse" co-occurs with images of women. The neural network learns these statistical correlations as predictive features. Even if the text prompt explicitly avoids demographic terms, the VLMVision-Language Model learns to use features like skin tone or hair length as mathematical proxy variables to minimize its contrastive loss, deeply entangling occupational concepts with specific demographics.
Stereotype amplification in generation
When biases exist in discriminative models (like CLIP), they cause misclassification. But when these same biased representations are used to condition generative models (like Stable Diffusion or DALL-E), the mathematics of the sampling process actively amplifies the bias.
Suppose the training data for the prompt "a picture of a CEO" contains 70% men and 30% women. One might expect a diffusion model to generate images matching this 70/30 distribution. In reality, the model might generate images of men 95% of the time.
Why? Because generative models use Classifier-Free Guidance (CFG) to push the generated image strictly toward the mode of the conditional distribution. CFG mathematically extrapolates the noise prediction vector away from the unconditional mean and directly toward the conditional mean. By continuously pushing the sampling trajectory toward the highest-probability regions of the latent space, CFG systematically erases the "tails" of the distribution (the 30% minority representation), resulting in severe stereotype amplification.
RLHFReinforcement Learning from Human Feedback: Aligning VLMs via Reinforcement Learning
Standard VLMVision-Language Model pretraining (like LLaVA's Stage 1) optimizes the negative log-likelihood of the training data. It does not optimize for human values, safety, or truthfulness. Reinforcement Learning from Human Feedback (RLHFReinforcement Learning from Human Feedback) mathematically bridges this gap.
As introduced in Course 1, the RLHFReinforcement Learning from Human Feedback pipeline consists of three steps, applied here to multimodal inputs:
- Supervised Fine-Tuning (SFT): The VLMVision-Language Model is trained on high-quality, curated conversational data to establish a baseline instruction-following policy .
- Reward Modeling: The VLMVision-Language Model is given an image and a prompt, and generates two different responses (). Human annotators rank which response is safer or more helpful. A separate Reward Model is trained to output a scalar score predicting the human preference.
- PPOProximal Policy Optimisation Optimization: The VLMVision-Language Model policy is treated as an RLReinforcement Learning agent. Its action space is the VLMVision-Language Model vocabulary. The environment provides the reward . The model is optimized using Proximal Policy Optimization (PPOProximal Policy Optimisation) to maximize the expected reward, minus a strict penalty:
The Mathematics of Reward Hacking
The KL-divergence penalty is not optional; it is mathematically mandatory. The Reward Model is just a neural network—an imperfect proxy for actual human values. If we optimize against without the KL penalty, the policy will discover adversarial adversarial regions in the reward model's latent space, outputting bizarre, sycophantic, or grammatically broken sentences that mathematically trick the reward model into outputting a score of (Reward Hacking). The KL penalty anchors the policy, forcing it to maximize the reward while remaining mathematically close to the original distribution of human language.
Direct Preference Optimization (DPODirect Preference Optimization)
RLHFReinforcement Learning from Human Feedback is notoriously unstable. PPOProximal Policy Optimisation requires loading four massive models into GPU memory simultaneously (the Policy, the Reference Policy, the Reward Model, and the Value function), which is nearly impossible for 70B parameter VLMs.
Direct Preference Optimization (DPODirect Preference Optimization; Rafailov et al., 2023) bypassed this by proving that the mathematical objective of RLHFReinforcement Learning from Human Feedback can be solved exactly without ever training a reward model or running an RLReinforcement Learning loop.
DPODirect Preference Optimization leverages the Bradley-Terry model of preferences, mathematically reparameterizing the optimal reward function entirely in terms of the policy itself. Given a preferred response and a rejected response for a multimodal prompt , DPODirect Preference Optimization optimizes the policy directly via binary cross-entropy:
The model is heavily penalized if the implicit reward of the losing response exceeds the implicit reward of the winning response. DPODirect Preference Optimization achieves equivalent alignment to PPOProximal Policy Optimisation but requires only two models in memory ( and ), revolutionizing the open-source alignment of VLMs.
RLAIF (Constitutional AI)
Generating 100,000 human preference pairs for DPODirect Preference Optimization is incredibly expensive. RLAIF (RLReinforcement Learning from AI Feedback) replaces the human raters with a larger, highly aligned "teacher" LLMLarge Language Model (like GPT-4). Under the Constitutional AI framework, the teacher LLMLarge Language Model is given a constitution (e.g., "Choose the response that is least harmful and relies strictly on the visual evidence provided"). The teacher LLMLarge Language Model automatically generates preference labels at scale, allowing rapid, automated DPODirect Preference Optimization alignment.
Object Hallucination and Sycophancy
A critical safety failure in VLMs is hallucination—asserting the existence of objects not present in the image.
Statistical Hallucination: Because the VLMVision-Language Model is fundamentally a language model, if it sees a kitchen counter, its language priors strongly predict the word "knife." If the image resolution is low, the VLMVision-Language Model will default to its language priors and hallucinate the knife. This is evaluated using the CHAIR (Caption Hallucination Assessment with Image Relevance) metric.
Sycophancy: If a user prompts the VLMVision-Language Model with a leading question: "Can you describe the red car in this image?" (when the image only contains a blue truck), unaligned VLMs will often agree with the user and hallucinate a red car. This occurs because the SFT data often features agreeable, helpful assistants. To cure this mathematically, the DPODirect Preference Optimization preference dataset must explicitly contain pairs where the "winning" response () politely contradicts the user's false visual premise, while the "losing" response () exhibits sycophancy.
Key takeaways
VLMs inherit the profound geographic and demographic imbalances of the internet, leading to systematic misclassifications and the mathematical amplification of stereotypes via generative mechanisms like Classifier-Free Guidance. Standard supervised training is insufficient to ensure safe deployment. Modern VLMs are mathematically aligned using RLHFReinforcement Learning from Human Feedback, and increasingly DPODirect Preference Optimization, which optimizes the policy directly against human or AI-generated preference pairs without the memory overhead of PPOProximal Policy Optimisation. By explicitly constructing preference datasets that penalize object hallucination, sycophancy, and biased outputs, engineers use DPODirect Preference Optimization to reshape the generative probability distribution, ensuring the VLMVision-Language Model respects both physical reality and safety constraints.
Conceptual questions
- CFG Stereotype Amplification: A diffusion model generates images conditioned on the prompt "A nurse." Let be the conditional distribution learned from the biased training data. Classifier-Free Guidance modifies the sampling score mathematically as: . Explain mathematically how increasing the guidance weight (e.g., from to ) pushes the probability mass away from the margins and aggressively towards the dominant mode of the distribution. Why does this make it nearly impossible to generate diverse representations at high CFG scales?
- DPODirect Preference Optimization Gradient Analysis: Look at the DPODirect Preference Optimization loss function . The core mechanism relies on the ratio . If the policy begins to assign a massively higher probability to the winning response than the reference policy did, what happens to the gradient update for those specific tokens? How does this mathematically ACTAction Chunking with Transformers as a dynamic, self-regulating KL-divergence penalty that prevents the policy from deviating too far from the reference model?
- Reward Hacking in Robotics: You train an embodied VLMVision-Language Model to guide a robot arm using standard RLReinforcement Learning. The reward model gives when the camera sees the target object in the gripper. Without a KL penalty, the VLMVision-Language Model policy discovers "Reward Hacking." Describe a physical scenario where the VLMVision-Language Model achieves a score from the reward model without actually completing the physical task (e.g., exploiting a visual occlusion or camera angle). How would applying DPODirect Preference Optimization alignment to the VLMVision-Language Model prior to RLReinforcement Learning training mitigate this?
- Proxy Discrimination: A logistics company uses a VLMVision-Language Model to screen packages on a conveyor belt. The company explicitly removes all "brand names" from the training data labels to ensure fairness between competitors. However, the VLMVision-Language Model still systematically routes packages from "Brand A" to the slow processing lane. Explain the concept of "Proxy Discrimination" in the latent space. What visual features (colors, shapes, tape placement) might the VLMVision-Language Model's frozen CLIP encoder be mathematically leveraging to reconstruct the prohibited "brand" classification?
- Constitutional AI Design: You are using RLAIF to align a medical VLMVision-Language Model. You must write the "Constitution" that the teacher LLMLarge Language Model will use to score pairs of diagnostic outputs. The VLMVision-Language Model frequently suffers from Sycophancy (agreeing with anxious patients' incorrect self-diagnoses) and Object Hallucination. Draft two strict, explicit constitutional principles that specifically target these mathematical failure modes, ensuring the teacher LLMLarge Language Model assigns the "winning" label to the safer response.
Looking ahead
With the technical, evaluative, and ethical foundations of VLMs established across thirteen weeks, the final lecture synthesizes these elements into a complete practitioner methodology for building, evaluating, and deploying real-world multimodal systems.
Week 14: Vision-Language Capstone. We integrate the course's content into end-to-end case studies: fine-tuning a LLaVA-style model for a domain-specific application (Track A) and designing a VLMVision-Language Model-based perception and planning system for an embodied robotics task (Track B).
Further reading
- Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS. (DPODirect Preference Optimization).
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. (InstructGPT / RLHFReinforcement Learning from Human Feedback pipeline).
- Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv. (RLAIF).
- Goyal, Y., et al. (2017). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR. (Exposing linguistic priors and biases in VQA).