Skip to main content
illumin8
Courses
Week 1: Modern Vision Backbones
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 1

Week 1: Modern Vision Backbones

✦Learning Outcomes
  • Derive the multi-head self-attention mechanism for image patches
  • Compare ViT scaling behavior with CNNs on data-limited vs. data-rich regimes
  • Analyze hierarchical variants (Swin Transformer) and 3D representations (NeRF, 3DGS)
◆Prerequisites

Related concepts from other courses:

  • Course 1 (RLReinforcement Learning): Attention mechanisms (Week 8) - similar mathematical framework

Purpose of this lecture

Vision–language models (VLMs) inherit nearly all of their visual understanding from a pretrained vision backbone. Before studying how images and language are aligned, how captions are generated, or how robots are instructed through language, it is necessary to understand the architectural principles that govern how modern vision models represent the physical world. This lecture traces the evolution from convolutional neural networks to Vision Transformers, formalizes the mathematics of self-attention for image patches, and examines why the transition to transformers occurred. We also look beyond 2D pixels to explore the frontier of native 3D spatial representations—Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS)—which are increasingly critical for grounding VLMs in physical environments.


Convolutional neural networks and their inductive biases

Convolutional neural networks (CNNs) dominated computer vision from the AlexNet breakthrough in 2012 through approximately 2021. Their success rested on three structural inductive biases baked directly into the convolution operation:

  1. Local connectivity: Each neuron connects only to a small spatial neighborhood of the previous layer, not the entire image, reflecting the physical reality that neighboring pixels are highly correlated.
  2. Weight sharing: The exact same filter (kernel) is applied at every spatial position across the image, enforcing translation equivariance—a cat in the top-left corner is processed by the same feature extractors as a cat in the bottom-right.
  3. Hierarchical feature composition: By stacking convolutions interleaved with pooling layers, the network builds a spatial hierarchy. Early layers detect low-level primitives (edges, gradients), middle layers detect textures and object parts (wheels, eyes), and deep layers aggregate these into semantic object representations.

These strong inductive biases are perfectly matched to the statistics of natural images. They also make CNNs highly parameter-efficient on smaller datasets (like ImageNet-1K with its 1.28 million images); a model can generalize from fewer examples because its hypothesis space is already constrained by the architectural structure.

However, the core limitation of locality is that long-range dependencies require deep, stacked layers. A 3×33 \times 33×3 convolution at a single layer has a receptive field of 3 pixels; after LLL layers without pooling, the effective receptive field grows only linearly as O(L)O(L)O(L). Modeling the relationship between two distant, disparate image regions (for instance, a person's hand and the object they are reaching for on the other side of a table) requires extremely deep stacks, which suffer from vanishing gradients and are slow at inference. Attention mechanisms, by contrast, can connect any two positions in the image in O(1)O(1)O(1) layers.


Vision Transformers: Formalizing Patch-based Attention

Vision Transformers (ViT; Dosovitskiy et al., 2021) achieved a paradigm shift by discarding convolutions entirely and applying the standard NLP transformer architecture directly to images. To do this, the spatial image must be converted into a sequence of discrete tokens.

Given an image x∈RH×W×Cx \in \mathbb{R}^{H \times W \times C}x∈RH×W×C, the image is divided into a grid of N=H⋅WP2N = \frac{H \cdot W}{P^2}N=P2H⋅W​ non-overlapping patches, where each patch has size P×PP \times PP×P pixels. Each patch is flattened into a vector of dimension P2CP^2 CP2C and linearly projected to the model's hidden dimension DDD:

zi(0)=xiE+ei,E∈RP2C×D,ei∈RDz_i^{(0)} = x_i E + e_i, \quad E \in \mathbb{R}^{P^2 C \times D}, \quad e_i \in \mathbb{R}^Dzi(0)​=xi​E+ei​,E∈RP2C×D,ei​∈RD

where xix_ixi​ is the flattened iii-th patch vector, EEE is the learned linear patch embedding matrix, and eie_iei​ is a learned positional embedding. A special class token xclsx_\text{cls}xcls​ (borrowed from BERT) is prepended to the sequence; its embedding at the final layer serves as the aggregated global image representation.

The resulting sequence z(0)=[xcls;z1(0);…;zN(0)]z^{(0)} = [x_\text{cls}; z_1^{(0)}; \ldots; z_N^{(0)}]z(0)=[xcls​;z1(0)​;…;zN(0)​] is processed by LLL standard transformer blocks. Each block comprises Multi-Head Self-Attention (MHSA) and a Feedforward Network (FFN), with Layer Normalization (LN) applied before each operation (pre-norm configuration):

z′(ℓ)=MHSA(LN(z(ℓ−1)))+z(ℓ−1)z'^{(\ell)} = \text{MHSA}(\text{LN}(z^{(\ell-1)})) + z^{(\ell-1)}z′(ℓ)=MHSA(LN(z(ℓ−1)))+z(ℓ−1) z(ℓ)=FFN(LN(z′(ℓ)))+z′(ℓ)z^{(\ell)} = \text{FFN}(\text{LN}(z'^{(\ell)})) + z'^{(\ell)}z(ℓ)=FFN(LN(z′(ℓ)))+z′(ℓ)

Multi-Head Self-Attention (MHSA) Math

To understand why ViTs capture global context so effectively, we must look at the MHSA operation. For a single attention head hhh, the input sequence Z∈R(N+1)×DZ \in \mathbb{R}^{(N+1) \times D}Z∈R(N+1)×D is projected into Queries (QQQ), Keys (KKK), and Values (VVV):

Qh=ZWQh,Kh=ZWKh,Vh=ZWVhQ_h = Z W_Q^h, \quad K_h = Z W_K^h, \quad V_h = Z W_V^hQh​=ZWQh​,Kh​=ZWKh​,Vh​=ZWVh​

where WQh,WKh,WVh∈RD×dkW_Q^h, W_K^h, W_V^h \in \mathbb{R}^{D \times d_k}WQh​,WKh​,WVh​∈RD×dk​ and dk=D/Hheadsd_k = D / H_\text{heads}dk​=D/Hheads​. The attention weights are computed using the scaled dot-product:

Attention(Qh,Kh,Vh)=softmax(QhKhTdk)Vh\text{Attention}(Q_h, K_h, V_h) = \text{softmax}\left(\frac{Q_h K_h^T}{\sqrt{d_k}}\right) V_hAttention(Qh​,Kh​,Vh​)=softmax(dk​​Qh​KhT​​)Vh​

The matrix QhKhT∈R(N+1)×(N+1)Q_h K_h^T \in \mathbb{R}^{(N+1) \times (N+1)}Qh​KhT​∈R(N+1)×(N+1) computes the unnormalized similarity score between every patch and every other patch simultaneously. The 1dk\frac{1}{\sqrt{d_k}}dk​​1​ scaling factor prevents the dot products from growing too large and pushing the softmax into regions with vanishing gradients. The outputs of all heads are concatenated and linearly projected to form the final MHSA output:

MHSA(Z)=Concat(head1,…,headH)WO\text{MHSA}(Z) = \text{Concat}(\text{head}_1, \dots, \text{head}_H) W_OMHSA(Z)=Concat(head1​,…,headH​)WO​

This mechanism enables every patch to attend to every other patch in a single layer, achieving global context routing without the depth requirements of CNN receptive fields.


Positional encodings

Transformers are permutation-invariant by construction—the self-attention operation computes pairwise similarities without any inherent concept of sequence order or spatial geometry. Shuffling the input patches would produce the exact same attention weights for those patches. Therefore, spatial structure must be encoded explicitly through the positional embeddings eie_iei​.

In standard ViT, these are learned 1D positional embeddings optimized during training for each of the NNN patch positions. Some 2D-aware models use factored 2D positional encodings, separating the row and column indices.

Interpolation for High-Resolution Fine-Tuning: ViT models are frequently pretrained at a lower resolution (e.g., 224×224224 \times 224224×224) and fine-tuned or deployed at higher resolutions (e.g., 384×384384 \times 384384×384). Because the positional embeddings eie_iei​ are learned for a fixed grid of NNN patches, increasing the resolution increases the number of patches, rendering the original embedding array too small. To solve this, the pretrained positional embeddings are treated as a 2D spatial grid, and a 2D bicubic interpolation is applied to stretch the grid to the new resolution. This allows the network to adapt to higher-density inputs using the spatial geometry learned at lower resolutions, at only a modest initial quality cost that fine-tuning quickly overcomes.


Scaling behavior of ViTs

The most profound empirical result for ViTs is their immense, unsaturating capacity for data-scale dependence.

On smaller datasets (like ImageNet-1K with ~1.2M images), ViTs typically underperform well-tuned ResNets. Because the ViT lacks the strong local inductive biases of a CNN, it must learn the concepts of translation equivariance and locality entirely from the data, which requires a massive amount of supervision.

However, on massive datasets (like JFT-300M with 300 million images, or LAION-5B), ViTs significantly outperform the best CNNs, and the performance gap strictly widens as the model size scales. This is the data-hungry but highly scalable property of attention-based models.

Scaling laws for ViTs are exceptionally smooth compared to CNNs. Performance on downstream classification improves predictably as a power law in both model parameter count and pretraining compute, with far fewer architecture-specific engineering tricks required. A ViT-G (1.8B parameters) achieves state-of-the-art results with almost identical hyperparameter configurations to a ViT-B (86M parameters).


Hierarchical and multi-scale variants (Swin Transformer)

Plain ViTs produce representations at a single, static spatial scale (one set of NNN tokens propagating through all LLL layers). This single-scale nature creates two major problems for dense prediction tasks like robotic manipulation, bounding-box detection, or pixel-level segmentation.

First, dense tasks require multi-scale feature pyramids (large features for coarse semantic location, small features for high-resolution edge detail). Second, processing N=(H/P)2N = (H/P)^2N=(H/P)2 tokens with full global self-attention incurs an O(N2)O(N^2)O(N2) computational cost, which scales quartically with the image dimension and becomes prohibitive for high-resolution robotic perception.

The Swin Transformer (Liu et al., 2021) directly addresses these flaws by re-introducing the hierarchical inductive biases of CNNs into the transformer architecture through three structural innovations:

  1. Window-based attention: Instead of global attention, self-attention is computed strictly within local windows of M×MM \times MM×M patches (e.g., M=7M=7M=7). This drops the computational complexity from O(N2)O(N^2)O(N2) to O(NM2)O(N M^2)O(NM2), which is linear with respect to the image size NNN.
  2. Shifted windows: If attention is only computed inside isolated windows, the model loses the ability to route information globally. Swin solves this by shifting the window partitions by (⌊M/2⌋,⌊M/2⌋)(\lfloor M/2 \rfloor, \lfloor M/2 \rfloor)(⌊M/2⌋,⌊M/2⌋) patches between consecutive transformer layers. This cross-window connection allows information to bleed across boundaries layer by layer. Mathematically, this is implemented using a cyclic shift of the feature map, followed by a masked self-attention operation to prevent tokens from attending to non-adjacent pixels that wrapped around the image edges.
  3. Patch merging (Hierarchical stages): To build a feature pyramid, Swin periodically merges groups of 2×22 \times 22×2 neighboring patches. This halves the spatial resolution (H/2,W/2H/2, W/2H/2,W/2) and doubles the channel dimension (2C2C2C), mimicking the downsampling strides of a CNN like ResNet.

This hybrid design successfully recovers the hierarchical inductive bias of CNNs while retaining the representational expressivity of attention, making Swin-based architectures a popular choice for high-resolution, dense visual tasks.

| Property | CNN | ViT (plain) | Swin | |---|---|---|---| | Long-range context | O(L)O(L)O(L) layers | O(1)O(1)O(1) layer | O(1)O(1)O(1) within window | | Multi-scale features | Native | Requires adaptation | Native | | Attention cost | N/A | O(N2)O(N^2)O(N2) | O(NM2)O(NM^2)O(NM2) | | Multimodal fusion | Awkward | Natural | Natural | | Foundation model use | Limited | Dominant | Common |


Beyond 2D: Native 3D Spatial Representations (NeRFs & 3DGS)

While 2D Vision Transformers are the backbone of most current VLMs, physical AI and robotics fundamentally operate in three dimensions. A robot querying a VLMVision-Language Model with "where is the mug?" requires 3D coordinates (x,y,z)(x, y, z)(x,y,z) to plan a trajectory, not just a 2D pixel bounding box. Consequently, the frontier of vision backbones involves native 3D scene representations.

Neural Radiance Fields (NeRFs)

A NeRF represents a continuous 3D scene not as a grid of voxels, but as the weights of a multilayer perceptron (MLP). The network takes a continuous 3D coordinate x=(x,y,z)\mathbf{x} = (x, y, z)x=(x,y,z) and a 2D viewing direction d=(θ,ϕ)\mathbf{d} = (\theta, \phi)d=(θ,ϕ) as inputs, and outputs volume density σ\sigmaσ and view-dependent color c=(r,g,b)\mathbf{c} = (r, g, b)c=(r,g,b).

To render a 2D image from a specific camera pose, rays are cast through the scene, and the MLP is queried at sampled points along each ray. The expected color of a pixel is computed using classical volume rendering integrals. By optimizing the MLP weights to minimize the photometric error between the rendered rays and a set of real 2D training images, the network implicitly memorizes the 3D geometry of the entire scene.

3D Gaussian Splatting (3DGS)

While NeRFs produce stunning geometry, querying an MLP millions of times per frame is computationally expensive, making real-time 50Hz robotic control difficult. 3D Gaussian Splatting has recently emerged as a faster, explicit alternative.

Instead of a continuous neural network, 3DGS represents a scene as millions of discrete, anisotropic 3D Gaussians. Each Gaussian has a center position μ\muμ, a 3D covariance matrix Σ\SigmaΣ (determining its scale and rotation), an opacity α\alphaα, and spherical harmonics encoding its color. Because these are explicit geometric primitives, they can be projected (splatted) onto a 2D image plane via highly optimized GPU rasterization pipelines, achieving rendering speeds orders of magnitude faster than NeRFs while maintaining equivalent fidelity.

Grounding VLMs in 3D

The crucial bridge to vision-language models occurs when these 3D representations are augmented with semantic embeddings. In systems like LERF (Language Embedded Radiance Fields) or ConceptGraphs, each 3D point (or Gaussian) is enriched with a high-dimensional semantic feature vector extracted from a VLMVision-Language Model (like CLIP). This allows the robot to execute open-vocabulary, spatial queries directly in the 3D environment, bypassing the limitations of purely 2D vision backbones.


Why ViTs dominate vision–language models

Despite the promise of 3D, the 2D Vision Transformer remains the dominant perceptual engine for multimodal systems. The architectural symmetry between transformers for vision and transformers for language is the primary reason. When both vision and language are represented as sequences of tokens processed by attention:

Shared infrastructure: The exact same mathematical operations (MHSA, FFN, LN) serve both modalities, simplifying engineering and hardware optimization (e.g., highly optimized FlashAttention kernels apply equally to images and text).

Natural fusion: Cross-attention between visual patch tokens and language tokens is a direct application of standard attention equations. No modality-specific architectural bridges or complex convolutional ROI-pooling mechanisms are required.

Joint scaling: Both modalities benefit from the same scaling laws, allowing simultaneous improvement of vision and language representations simply by increasing the compute budget and dataset scale.

Sequence compatibility: Text tokens and image patch tokens are dimensionally compatible and can be concatenated or interleaved into a single, massive context window. This sequence uniformity enables the multimodal in-context learning that underpins models like Flamingo, RT-2, and LLaVA.

CLIP's image encoder, the visual encoder in BLIP-2, and the perception towers in cutting-edge robotics VLAs are nearly all variants of the plain ViT—the architectural choice that unlocked scalable multimodal AI.


Key takeaways

CNNs achieve strong performance through local connectivity, weight sharing, and hierarchical composition, but struggle to model long-range dependencies efficiently. The Vision Transformer (ViT) treats images as a sequence of flattened patch tokens, adding learned positional embeddings and processing them via Multi-Head Self-Attention. This grants global context in O(1)O(1)O(1) layers at the cost of requiring massive pretraining datasets. ViTs scale predictably and eventually outperform CNNs at the billion-parameter scale. Hierarchical variants like the Swin Transformer re-introduce multi-scale features and reduce attention cost using shifted windows. At the frontier of physical AI, native 3D representations like NeRFs and 3DGS are being enriched with VLMVision-Language Model features to enable true spatial reasoning. Ultimately, the architectural symmetry between ViTs and LLMs makes transformer-based vision the undisputed foundation for modern multimodal systems.


Conceptual questions

  1. Self-Attention Math Debugging: You are implementing a custom ViT-B/16 for a specialized satellite imagery pipeline. Your input images are high-resolution (1024×10241024 \times 10241024×1024), meaning N=4096N = 4096N=4096 patch tokens. During your first training run, the model OOMs (Out of Memory) during the forward pass. Mathematically define the shape of the unnormalized attention score matrix QhKhTQ_h K_h^TQh​KhT​ inside a single attention head. Based on this dimensionality, explain precisely why the memory footprint scales quadratically with NNN, and propose an architectural drop-in replacement (e.g., from the Swin family) to resolve this.
  2. Positional Interpolation Tradeoffs: Your team downloads a ViT-L/14 that was pretrained on 224×224224 \times 224224×224 images. You need to fine-tune it for a medical diagnostics task using 448×448448 \times 448448×448 scans. Describe the exact process of 2D bicubic interpolation applied to the learned positional embeddings eie_iei​. If the medical condition relies heavily on high-frequency, single-pixel anomalies, explain what information is fundamentally blurred during this interpolation step and hypothesize whether the model will struggle during early fine-tuning epochs.
  3. Swin Window Masking: In a Swin Transformer, the cyclic shift allows cross-window information flow. However, shifting the image grid means that patches from the far left edge of the image are temporarily placed adjacent to patches from the far right edge. If uncorrected, the attention mechanism will compute high similarity scores between these physically distant objects. Explain how Swin uses a masked self-attention matrix to prevent this invalid attention routing.
  4. CNN vs ViT Scaling: A junior engineer trains a ResNet-50 and a ViT-B/16 from scratch on a proprietary dataset of only 50,000 images. The ResNet achieves 82% accuracy, while the ViT stalls at 64%. The engineer concludes the ViT is a flawed architecture for this domain. Provide a rigorous counter-argument referencing the specific inductive biases (local connectivity, translation equivariance) that CNNs possess and ViTs lack. What pretraining strategy must be employed before the ViT can become competitive on this small dataset?
  5. 3D Spatial Grounding: A robotics startup is using a 2D VLMVision-Language Model to command a drone. The prompt is "Fly towards the red car." The VLMVision-Language Model successfully returns a 2D bounding box of the red car in the drone's camera feed, but the drone cannot plan a stable trajectory because it lacks depth perception. Explain how incorporating a Language Embedded Radiance Field (LERF) or a semantically-enriched 3D Gaussian Splatting (3DGS) pipeline would solve this limitation, explicitly comparing the outputs of a 2D bounding box versus a 3D semantic coordinate query.
✦Solutions
  1. The score matrix QhKh⊤Q_h K_h^\topQh​Kh⊤​ is N×N=4096×4096N \times N = 4096 \times 4096N×N=4096×4096 per head. Every token attends to every other token, so the scores and their softmax cost O(N2)O(N^2)O(N2) in both time and memory — going from 256256256 to 409640964096 tokens blows the matrix up 256×256\times256×. Drop-in fix: Swin's windowed attention restricts attention to local M×MM \times MM×M windows, making the cost O(NM2)O(N M^2)O(NM2), i.e. linear in NNN.
  2. The learned positional grid (e.g. 16×1616\times1616×16 tokens at 224224224) is reshaped to 2D and bicubically upsampled to the new grid (32×3232\times3232×32 at 448448448). Bicubic interpolation is a smoothing, low-pass operation, so high-frequency single-pixel positional detail is blurred away. The model loses precise fine-scale position discrimination and will indeed struggle in the early fine-tuning epochs until the embeddings re-adapt.
  3. After the cyclic shift a single window can contain patches from opposite image edges. Swin computes attention within each window but adds −∞-\infty−∞ to score entries between tokens originating from different (non-contiguous) regions, so the softmax drives those weights to zero — blocking invalid long-range attention while keeping the efficient batched window computation.
  4. CNNs encode strong inductive biases — locality and translation equivariance via weight-shared convolutions — that act as priors and make them data-efficient on 50k images; ViTs lack these and must learn them from data, needing far more samples. The engineer's conclusion is wrong: pretrain the ViT first (large-scale supervised such as ImageNet-21k/JFT, or self-supervised MAE/DINO), then fine-tune — with pretraining, ViTs match or exceed CNNs.
  5. A 2D bounding box gives only image-plane coordinates with no depth, so there is no metric 3D goal to plan toward. A LERF or semantic 3DGS pipeline embeds language features into a 3D scene, so the query "red car" returns a 3D semantic coordinate (x,y,z)(x,y,z)(x,y,z) in metric space — directly plannable. The 2D box is depth-ambiguous; the 3D semantic query yields a stable spatial target.

Looking ahead

With the vision backbone established, the next question is how to train it without human labels at the scale required for multimodal foundation models.

Week 2: Representation Learning for Vision. We examine the three dominant self-supervised paradigms—contrastive learning, masked autoencoding, and self-distillation—and analyze how each mathematically shapes the properties of the resulting visual representations that downstream VLMs depend on.


Further reading

  • Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. (The original Vision Transformer / ViT paper).
  • Liu, Z., et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV. (Introduced local windows and hierarchical scaling for ViTs).
  • Mildenhall, B., et al. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV.
  • Kerbl, B., et al. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH.
Next →
Week 2: Self-Supervised Representation Learning for Vision
On this page
  • Purpose of this lecture
  • Convolutional neural networks and their inductive biases
  • Vision Transformers: Formalizing Patch-based Attention
  • Multi-Head Self-Attention (MHSA) Math
  • Positional encodings
  • Scaling behavior of ViTs
  • Hierarchical and multi-scale variants (Swin Transformer)
  • Beyond 2D: Native 3D Spatial Representations (NeRFs & 3DGS)
  • Neural Radiance Fields (NeRFs)
  • 3D Gaussian Splatting (3DGS)
  • Grounding VLMs in 3D
  • Why ViTs dominate vision–language models
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading