Purpose of this lecture
Vision–language models (VLMs) inherit nearly all of their visual understanding from a pretrained vision backbone. Before studying how images and language are aligned, how captions are generated, or how robots are instructed through language, it is necessary to understand the architectural principles that govern how modern vision models represent the physical world. This lecture traces the evolution from convolutional neural networks to Vision Transformers, formalizes the mathematics of self-attention for image patches, and examines why the transition to transformers occurred. We also look beyond 2D pixels to explore the frontier of native 3D spatial representations—Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS)—which are increasingly critical for grounding VLMs in physical environments.
Convolutional neural networks and their inductive biases
Convolutional neural networks (CNNs) dominated computer vision from the AlexNet breakthrough in 2012 through approximately 2021. Their success rested on three structural inductive biases baked directly into the convolution operation:
- Local connectivity: Each neuron connects only to a small spatial neighborhood of the previous layer, not the entire image, reflecting the physical reality that neighboring pixels are highly correlated.
- Weight sharing: The exact same filter (kernel) is applied at every spatial position across the image, enforcing translation equivariance—a cat in the top-left corner is processed by the same feature extractors as a cat in the bottom-right.
- Hierarchical feature composition: By stacking convolutions interleaved with pooling layers, the network builds a spatial hierarchy. Early layers detect low-level primitives (edges, gradients), middle layers detect textures and object parts (wheels, eyes), and deep layers aggregate these into semantic object representations.
These strong inductive biases are perfectly matched to the statistics of natural images. They also make CNNs highly parameter-efficient on smaller datasets (like ImageNet-1K with its 1.28 million images); a model can generalize from fewer examples because its hypothesis space is already constrained by the architectural structure.
However, the core limitation of locality is that long-range dependencies require deep, stacked layers. A convolution at a single layer has a receptive field of 3 pixels; after layers without pooling, the effective receptive field grows only linearly as . Modeling the relationship between two distant, disparate image regions (for instance, a person's hand and the object they are reaching for on the other side of a table) requires extremely deep stacks, which suffer from vanishing gradients and are slow at inference. Attention mechanisms, by contrast, can connect any two positions in the image in layers.
Vision Transformers: Formalizing Patch-based Attention
Vision Transformers (ViT; Dosovitskiy et al., 2021) achieved a paradigm shift by discarding convolutions entirely and applying the standard NLP transformer architecture directly to images. To do this, the spatial image must be converted into a sequence of discrete tokens.
Given an image , the image is divided into a grid of non-overlapping patches, where each patch has size pixels. Each patch is flattened into a vector of dimension and linearly projected to the model's hidden dimension :
where is the flattened -th patch vector, is the learned linear patch embedding matrix, and is a learned positional embedding. A special class token (borrowed from BERT) is prepended to the sequence; its embedding at the final layer serves as the aggregated global image representation.
The resulting sequence is processed by standard transformer blocks. Each block comprises Multi-Head Self-Attention (MHSA) and a Feedforward Network (FFN), with Layer Normalization (LN) applied before each operation (pre-norm configuration):
Multi-Head Self-Attention (MHSA) Math
To understand why ViTs capture global context so effectively, we must look at the MHSA operation. For a single attention head , the input sequence is projected into Queries (), Keys (), and Values ():
where and . The attention weights are computed using the scaled dot-product:
The matrix computes the unnormalized similarity score between every patch and every other patch simultaneously. The scaling factor prevents the dot products from growing too large and pushing the softmax into regions with vanishing gradients. The outputs of all heads are concatenated and linearly projected to form the final MHSA output:
This mechanism enables every patch to attend to every other patch in a single layer, achieving global context routing without the depth requirements of CNN receptive fields.
Positional encodings
Transformers are permutation-invariant by construction—the self-attention operation computes pairwise similarities without any inherent concept of sequence order or spatial geometry. Shuffling the input patches would produce the exact same attention weights for those patches. Therefore, spatial structure must be encoded explicitly through the positional embeddings .
In standard ViT, these are learned 1D positional embeddings optimized during training for each of the patch positions. Some 2D-aware models use factored 2D positional encodings, separating the row and column indices.
Interpolation for High-Resolution Fine-Tuning: ViT models are frequently pretrained at a lower resolution (e.g., ) and fine-tuned or deployed at higher resolutions (e.g., ). Because the positional embeddings are learned for a fixed grid of patches, increasing the resolution increases the number of patches, rendering the original embedding array too small. To solve this, the pretrained positional embeddings are treated as a 2D spatial grid, and a 2D bicubic interpolation is applied to stretch the grid to the new resolution. This allows the network to adapt to higher-density inputs using the spatial geometry learned at lower resolutions, at only a modest initial quality cost that fine-tuning quickly overcomes.
Scaling behavior of ViTs
The most profound empirical result for ViTs is their immense, unsaturating capacity for data-scale dependence.
On smaller datasets (like ImageNet-1K with ~1.2M images), ViTs typically underperform well-tuned ResNets. Because the ViT lacks the strong local inductive biases of a CNN, it must learn the concepts of translation equivariance and locality entirely from the data, which requires a massive amount of supervision.
However, on massive datasets (like JFT-300M with 300 million images, or LAION-5B), ViTs significantly outperform the best CNNs, and the performance gap strictly widens as the model size scales. This is the data-hungry but highly scalable property of attention-based models.
Scaling laws for ViTs are exceptionally smooth compared to CNNs. Performance on downstream classification improves predictably as a power law in both model parameter count and pretraining compute, with far fewer architecture-specific engineering tricks required. A ViT-G (1.8B parameters) achieves state-of-the-art results with almost identical hyperparameter configurations to a ViT-B (86M parameters).
Hierarchical and multi-scale variants (Swin Transformer)
Plain ViTs produce representations at a single, static spatial scale (one set of tokens propagating through all layers). This single-scale nature creates two major problems for dense prediction tasks like robotic manipulation, bounding-box detection, or pixel-level segmentation.
First, dense tasks require multi-scale feature pyramids (large features for coarse semantic location, small features for high-resolution edge detail). Second, processing tokens with full global self-attention incurs an computational cost, which scales quartically with the image dimension and becomes prohibitive for high-resolution robotic perception.
The Swin Transformer (Liu et al., 2021) directly addresses these flaws by re-introducing the hierarchical inductive biases of CNNs into the transformer architecture through three structural innovations:
- Window-based attention: Instead of global attention, self-attention is computed strictly within local windows of patches (e.g., ). This drops the computational complexity from to , which is linear with respect to the image size .
- Shifted windows: If attention is only computed inside isolated windows, the model loses the ability to route information globally. Swin solves this by shifting the window partitions by patches between consecutive transformer layers. This cross-window connection allows information to bleed across boundaries layer by layer. Mathematically, this is implemented using a cyclic shift of the feature map, followed by a masked self-attention operation to prevent tokens from attending to non-adjacent pixels that wrapped around the image edges.
- Patch merging (Hierarchical stages): To build a feature pyramid, Swin periodically merges groups of neighboring patches. This halves the spatial resolution () and doubles the channel dimension (), mimicking the downsampling strides of a CNN like ResNet.
This hybrid design successfully recovers the hierarchical inductive bias of CNNs while retaining the representational expressivity of attention, making Swin-based architectures a popular choice for high-resolution, dense visual tasks.
| Property | CNN | ViT (plain) | Swin | |---|---|---|---| | Long-range context | layers | layer | within window | | Multi-scale features | Native | Requires adaptation | Native | | Attention cost | N/A | | | | Multimodal fusion | Awkward | Natural | Natural | | Foundation model use | Limited | Dominant | Common |
Beyond 2D: Native 3D Spatial Representations (NeRFs & 3DGS)
While 2D Vision Transformers are the backbone of most current VLMs, physical AI and robotics fundamentally operate in three dimensions. A robot querying a VLMVision-Language Model with "where is the mug?" requires 3D coordinates to plan a trajectory, not just a 2D pixel bounding box. Consequently, the frontier of vision backbones involves native 3D scene representations.
Neural Radiance Fields (NeRFs)
A NeRF represents a continuous 3D scene not as a grid of voxels, but as the weights of a multilayer perceptron (MLP). The network takes a continuous 3D coordinate and a 2D viewing direction as inputs, and outputs volume density and view-dependent color .
To render a 2D image from a specific camera pose, rays are cast through the scene, and the MLP is queried at sampled points along each ray. The expected color of a pixel is computed using classical volume rendering integrals. By optimizing the MLP weights to minimize the photometric error between the rendered rays and a set of real 2D training images, the network implicitly memorizes the 3D geometry of the entire scene.
3D Gaussian Splatting (3DGS)
While NeRFs produce stunning geometry, querying an MLP millions of times per frame is computationally expensive, making real-time 50Hz robotic control difficult. 3D Gaussian Splatting has recently emerged as a faster, explicit alternative.
Instead of a continuous neural network, 3DGS represents a scene as millions of discrete, anisotropic 3D Gaussians. Each Gaussian has a center position , a 3D covariance matrix (determining its scale and rotation), an opacity , and spherical harmonics encoding its color. Because these are explicit geometric primitives, they can be projected (splatted) onto a 2D image plane via highly optimized GPU rasterization pipelines, achieving rendering speeds orders of magnitude faster than NeRFs while maintaining equivalent fidelity.
Grounding VLMs in 3D
The crucial bridge to vision-language models occurs when these 3D representations are augmented with semantic embeddings. In systems like LERF (Language Embedded Radiance Fields) or ConceptGraphs, each 3D point (or Gaussian) is enriched with a high-dimensional semantic feature vector extracted from a VLMVision-Language Model (like CLIP). This allows the robot to execute open-vocabulary, spatial queries directly in the 3D environment, bypassing the limitations of purely 2D vision backbones.
Why ViTs dominate vision–language models
Despite the promise of 3D, the 2D Vision Transformer remains the dominant perceptual engine for multimodal systems. The architectural symmetry between transformers for vision and transformers for language is the primary reason. When both vision and language are represented as sequences of tokens processed by attention:
Shared infrastructure: The exact same mathematical operations (MHSA, FFN, LN) serve both modalities, simplifying engineering and hardware optimization (e.g., highly optimized FlashAttention kernels apply equally to images and text).
Natural fusion: Cross-attention between visual patch tokens and language tokens is a direct application of standard attention equations. No modality-specific architectural bridges or complex convolutional ROI-pooling mechanisms are required.
Joint scaling: Both modalities benefit from the same scaling laws, allowing simultaneous improvement of vision and language representations simply by increasing the compute budget and dataset scale.
Sequence compatibility: Text tokens and image patch tokens are dimensionally compatible and can be concatenated or interleaved into a single, massive context window. This sequence uniformity enables the multimodal in-context learning that underpins models like Flamingo, RT-2, and LLaVA.
CLIP's image encoder, the visual encoder in BLIP-2, and the perception towers in cutting-edge robotics VLAs are nearly all variants of the plain ViT—the architectural choice that unlocked scalable multimodal AI.
Key takeaways
CNNs achieve strong performance through local connectivity, weight sharing, and hierarchical composition, but struggle to model long-range dependencies efficiently. The Vision Transformer (ViT) treats images as a sequence of flattened patch tokens, adding learned positional embeddings and processing them via Multi-Head Self-Attention. This grants global context in layers at the cost of requiring massive pretraining datasets. ViTs scale predictably and eventually outperform CNNs at the billion-parameter scale. Hierarchical variants like the Swin Transformer re-introduce multi-scale features and reduce attention cost using shifted windows. At the frontier of physical AI, native 3D representations like NeRFs and 3DGS are being enriched with VLMVision-Language Model features to enable true spatial reasoning. Ultimately, the architectural symmetry between ViTs and LLMs makes transformer-based vision the undisputed foundation for modern multimodal systems.
Conceptual questions
- Self-Attention Math Debugging: You are implementing a custom ViT-B/16 for a specialized satellite imagery pipeline. Your input images are high-resolution (), meaning patch tokens. During your first training run, the model OOMs (Out of Memory) during the forward pass. Mathematically define the shape of the unnormalized attention score matrix inside a single attention head. Based on this dimensionality, explain precisely why the memory footprint scales quadratically with , and propose an architectural drop-in replacement (e.g., from the Swin family) to resolve this.
- Positional Interpolation Tradeoffs: Your team downloads a ViT-L/14 that was pretrained on images. You need to fine-tune it for a medical diagnostics task using scans. Describe the exact process of 2D bicubic interpolation applied to the learned positional embeddings . If the medical condition relies heavily on high-frequency, single-pixel anomalies, explain what information is fundamentally blurred during this interpolation step and hypothesize whether the model will struggle during early fine-tuning epochs.
- Swin Window Masking: In a Swin Transformer, the cyclic shift allows cross-window information flow. However, shifting the image grid means that patches from the far left edge of the image are temporarily placed adjacent to patches from the far right edge. If uncorrected, the attention mechanism will compute high similarity scores between these physically distant objects. Explain how Swin uses a masked self-attention matrix to prevent this invalid attention routing.
- CNN vs ViT Scaling: A junior engineer trains a ResNet-50 and a ViT-B/16 from scratch on a proprietary dataset of only 50,000 images. The ResNet achieves 82% accuracy, while the ViT stalls at 64%. The engineer concludes the ViT is a flawed architecture for this domain. Provide a rigorous counter-argument referencing the specific inductive biases (local connectivity, translation equivariance) that CNNs possess and ViTs lack. What pretraining strategy must be employed before the ViT can become competitive on this small dataset?
- 3D Spatial Grounding: A robotics startup is using a 2D VLMVision-Language Model to command a drone. The prompt is "Fly towards the red car." The VLMVision-Language Model successfully returns a 2D bounding box of the red car in the drone's camera feed, but the drone cannot plan a stable trajectory because it lacks depth perception. Explain how incorporating a Language Embedded Radiance Field (LERF) or a semantically-enriched 3D Gaussian Splatting (3DGS) pipeline would solve this limitation, explicitly comparing the outputs of a 2D bounding box versus a 3D semantic coordinate query.
Looking ahead
With the vision backbone established, the next question is how to train it without human labels at the scale required for multimodal foundation models.
Week 2: Representation Learning for Vision. We examine the three dominant self-supervised paradigms—contrastive learning, masked autoencoding, and self-distillation—and analyze how each mathematically shapes the properties of the resulting visual representations that downstream VLMs depend on.
Further reading
- Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. (The original Vision Transformer / ViT paper).
- Liu, Z., et al. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ICCV. (Introduced local windows and hierarchical scaling for ViTs).
- Mildenhall, B., et al. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV.
- Kerbl, B., et al. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH.