Week 7: Alternative VLM Architectures

Purpose of this lecture#

The projector-based design (BLIP-2, LLaVA) dominates the current VLM landscape due to its simplicity and ease of training on consumer hardware. However, it suffers from fundamental structural limitations: visual tokens are injected exactly once at the input, and the LLM must mathematically propagate this visual context forward through all its self-attention layers without any fresh visual access. For tasks involving multiple images, long video temporal sequences, or continuous streaming perception, this design catastrophically degrades because the visual information becomes diluted or exceeds the LLM's fixed context window.

This lecture explores the frontier of VLM architectures designed to bypass these bottlenecks. We examine Flamingo's depth-distributed gated cross-attention, the Perceiver's latent bottleneck for infinite-length sequences, PaLI's unified multilingual encoder-decoder, and finally, how these alternative architectures extend beyond vision to incorporate tactile and non-visual modalities for physical AI.

Flamingo: Cross-attention at depth#

Flamingo (Alayrac et al., 2022) addresses the context-dilution problem by structurally intertwining the visual and linguistic pathways. Instead of appending visual tokens to the input prompt, Flamingo interleaves cross-attention layers throughout the frozen language model.

Architecture#

Flamingo starts with a frozen vision encoder (NFNet-F6 or CLIP) to produce raw visual features for each image in the context. Because an image might have $1024$ spatial tokens, and a prompt might contain $K$ images, passing $1024 \times K$ tokens into the LLM is unscalable.

To solve this, Flamingo introduces the Perceiver Resampler, which compresses any variable-length visual feature sequence into a strict, fixed number of $M = 64$ visual tokens. These $64$ tokens are stored in a standalone "visual memory" cache.

The LLM (e.g., Chinchilla-70B) is kept strictly frozen. However, new, randomly initialized Gated Cross-Attention layers are inserted before every $k$ -th transformer block (typically every 4th or 7th block). During generation, at these specific depths, the language tokens ACT as Queries to attend to the Keys and Values stored in the visual memory cache.

The Mathematics of Gated Cross-Attention#

Training a massive LLM with newly injected, randomly initialized attention layers usually causes catastrophic forgetting—the random gradients destroy the LLM's pretrained representations. Flamingo solves this using a zero-initialized gate:

h' = h + \tanh(\alpha) \cdot \text{CrossAttn}(Q=h, K=V=F_\text{visual})

Here, $\alpha$ is a learnable scalar parameter that is explicitly initialized to $0$ . At step $t=0$ of training, $\tanh(0) = 0$ , meaning the cross-attention output is completely zeroed out. The model $h' = h$ behaves mathematically identically to the original frozen LLM. During training, the gradients slowly push $\alpha$ away from zero, allowing the model to smoothly and gradually learn to incorporate visual information without destabilizing the linguistic priors.

Multi-image and temporal modeling in Flamingo#

The architectural genius of Flamingo's depth-distributed cross-attention is its native support for multi-image interleaved contexts and video.

In a projector-based model, processing $K=5$ images requires concatenating $5 \times 256 = 1280$ visual tokens into the input sequence, consuming the LLM's context window quadratically ( $O(L^2)$ ). In Flamingo, the LLM only processes the text tokens. When the text tokens reach a gated cross-attention layer, they apply attention over a fixed $K \times 64$ visual memory bank.

To handle interleaved image-text formats (e.g., a web page with text, an image, more text, another image), Flamingo applies a learned causal masking matrix to the cross-attention layer. A text token is only allowed to attend to the visual memories of the image that immediately preceded it in the sequence.

For video, individual frames are processed independently by the vision encoder and Perceiver Resampler, yielding $T \times 64$ tokens. Frame-specific 1D temporal positional embeddings are added. The language model can now generate a description of the entire video by dynamically attending to any frame's visual memory at any layer, effectively allowing early text generation steps to query the beginning of the video, and later steps to query the end.

Perceiver-style architectures (The Latent Bottleneck)#

As the sequence length $N$ of inputs grows (e.g., high-resolution 4K video or multi-camera arrays on a robot), standard self-attention collapses due to its $O(N^2)$ memory and compute scaling.

Perceiver (Jaegle et al., 2021) and its derivatives solve this by introducing an asymmetric latent bottleneck. The architecture relies on two distinct operations:

1. Cross-attention from Latents to Inputs (The Bottleneck): Instead of the inputs attending to themselves, the network initializes $M$ latent queries $Q \in \mathbb{R}^{M \times D}$ (where $M \ll N$ , e.g., $M=256$ ). These queries attend to the massive input sequence $X \in \mathbb{R}^{N \times D_\text{in}}$ :

L = \text{CrossAttn}(Q=\text{latents}, K=V=X)

The mathematical complexity of this operation is $O(M \cdot N)$ . Because $M$ is small and fixed, this operation scales strictly linearly with the input length $N$ .

2. Self-attention within the Latent Array: After the cross-attention step distills the massive input $X$ into the compact $M$ latents, all subsequent transformer blocks operate only on the $M$ latent tokens. The self-attention cost is $O(M^2)$ , which is entirely independent of the raw input length $N$ .

The total computational complexity drops from $O(N^2)$ to $O(M \cdot N + L \cdot M^2)$ . This is the exact mechanism inside Flamingo's Perceiver Resampler, and it is the only way current VLMs can process thousands of raw video frames without running out of GPU memory.

PaLI: Unified Vision-Language Encoder-Decoder#

PaLI (Pathways Language and Image model; Chen et al., 2023) represents the philosophical opposite of BLIP-2 and LLaVA. Rather than separating vision and language into distinct, modular, frozen backbones connected by a tiny learned interface, PaLI trains a massive, unified encoder-decoder architecture jointly on both modalities from scratch.

Architecture#

PaLI uses a giant 4B-parameter ViT-e image encoder to extract dense visual tokens. Crucially, it does not use a Q-Former or Perceiver. It passes the raw visual tokens directly into a massive (up to 540B parameters) mT5-based encoder-decoder text model. The text encoder processes the visual tokens and text tokens identically, and the text decoder autoregressively generates the response. All parameters (both vision and language) are co-trained.

Multilingual Multitask Generalization#

PaLI is trained simultaneously on a mixture of over 100 languages across tasks like VQA, OCR, captioning, and image classification. This unified multi-task, multi-lingual objective forces a profound geometric alignment: the visual representation of an "apple" is mathematically aligned simultaneously with the English token "apple," the French token "pomme," and the Spanish token "pomme." Consequently, PaLI exhibits zero-shot transfer across languages—if it is fine-tuned to perform a complex visual task purely in English, it can instantly execute that visual task when prompted in Japanese, relying entirely on the unified semantic space it built during pretraining.

Beyond Vision: Tactile and Non-Visual Modalities#

Current VLMs are predominantly "Vision-Language." However, physical AI and robotics are fundamentally about contact. A robot cannot effectively screw in a bolt, fold a deformable cloth, or grasp a fragile egg relying solely on an RGB camera feed that gets occluded by the robot's own arm.

Alternative architectures are now being expanded into Multimodal Transformers that process force, touch, and audio time-series alongside vision:

1. Tactile Sensors (GelSight / DIGIT): Modern robotic fingertips use elastomer-based cameras that capture high-resolution deformation topography when pressing into an object. Architecturally, these readings are processed as localized video streams. A ViT processes the tactile frames, and a Perceiver bottleneck compresses them into "tactile tokens" that are interleaved with the standard visual tokens in the LLM's context window.

2. Force-Torque (F/T) Sequences: A robot's wrist sensors output continuous 6-DoF data $(F_x, F_y, F_z, \tau_x, \tau_y, \tau_z)$ . Rather than treating this as vision, it is treated as a 1D sequence. A 1D Convolutional network or an LSTM encodes the temporal history of the forces into a distinct set of "proprioceptive tokens."

3. Architectural Fusion: To fuse these modalities, networks often utilize the Perceiver architecture. The fixed latent array $Q$ is allowed to cross-attend to the concatenated massive sequence of $X = [X_\text{vision}, X_\text{tactile}, X_\text{force}, X_\text{audio}]$ . This allows the network's attention mechanism to dynamically weight whichever modality is currently providing the strongest signal (e.g., relying purely on vision while reaching for an object, but immediately shifting 90% of the attention weights to the tactile and force tokens the millisecond the gripper makes contact).

Architectural Comparison#

| Architecture | Visual injection | Multi-image support | Parameter count (trainable) | Compute cost | |---|---|---|---|---| | LLaVA-style | Once at input | Limited (quadratic context explosion) | Projector + LLM LoRA | Low | | Flamingo | At every $k$ -th LLM layer | Native (via visual memory bank) | Gated Cross-attn layers | Medium | | Perceiver | Latent compression | Infinite inputs (linear scaling) | Latent Queries + processor | Medium | | PaLI | Unified joint training | Via interleaved tokens | All parameters (500B+) | Very High |

There is no single "best" architecture. The optimal choice depends entirely on deployment constraints: whether the application requires streaming video support (Flamingo/Perceiver), multilingual spatial reasoning (PaLI), or fast, single-image inference on a localized edge-compute board (LLaVA).

Key takeaways#

Alternative VLM architectures bypass the limitations of simple linear projectors. Flamingo inserts zero-initialized, gated cross-attention layers throughout a frozen LLM, creating a persistent "visual memory" that effortlessly scales to multi-image and video reasoning without quadratic context explosion. Perceiver architectures decouple computational complexity from input length by forcing massive multimodal inputs (vision, tactile, audio) through a tiny, fixed-size latent bottleneck, achieving $O(M \cdot N)$ linear scaling. PaLI discards frozen modularity entirely, training a monolithic encoder-decoder that binds visual features to over 100 languages simultaneously. Ultimately, moving physical AI into the real world requires architectures capable of fusing high-frequency, non-visual modalities (like force and touch) into the same semantic latent space.

Conceptual questions#

Gated Attention Initialization: Flamingo initializes its cross-attention gate with $\alpha = 0$ , meaning $h' = h + 0$ . Explain the mathematical severity of catastrophic forgetting if $\alpha$ was instead initialized from a standard Gaussian distribution $\mathcal{N}(0, 1)$ . Specifically, how would the introduction of random visual matrices into the residual stream immediately impact the LLM's cross-entropy loss on language modeling during the first 100 steps of training?
Perceiver Resampler Mathematics: Consider a robotics application streaming 5 cameras at 30fps. Over a 3-second window, this generates 450 frames. If each frame produces 256 ViT patch tokens, calculate the total input sequence length $N$ . If this is fed into a Perceiver Resampler with $M=64$ latent queries, calculate the ratio of operations required for the cross-attention bottleneck $O(MN)$ compared to attempting full self-attention $O(N^2)$ . Why is the Perceiver mathematically mandatory for continuous embodied agents?
PaLI Multilingual Grounding: PaLI trains jointly on 100+ languages. Suppose the model is trained on image-text pairs of "a red truck" in English, and purely text-text translation pairs linking English "red truck" to German "camion rouge" (Note: assuming a hypothetical linguistic link). Explain the geometric mechanism in the shared embedding space that allows PaLI to accurately answer questions about an image of a red truck in German (zero-shot visual transfer), despite never seeing a German caption paired with a physical image during training.
Tactile Modality Fusion: You are designing a multimodal transformer for a robotic hand. The prompt is: "Does this fabric feel like silk or burlap?" The visual camera is completely blocked by the robot's own fingers. Using the Perceiver latent fusion paradigm $L = \text{CrossAttn}(Q, [X_\text{vision}, X_\text{tactile}])$ , describe what specific shifts you expect to observe in the softmax attention weights over the $[X_\text{vision}, X_\text{tactile}]$ sequence right as the robot makes physical contact with the fabric. How does the architecture prevent the blocked visual tokens from poisoning the prediction?
Architectural System Design: Construct a decision framework for a startup building AI for autonomous drones inspecting wind turbines. The drones capture continuous 4K video, record audio (listening for blade cracks), and must generate detailed, multi-paragraph engineering reports in 3 languages. Evaluate LLaVA, Flamingo, Perceiver, and PaLI against these constraints. Which specific combination of architectural components (e.g., a Perceiver frontend attached to a PaLI backend) provides the optimal balance of infinite-context handling and multilingual generation?

Solutions

Gated-attention init. With $\alpha=0$ the cross-attention is a no-op at start ( $h'=h$ ), so the LLM begins exactly as the pretrained model and nothing is forgotten. Initializing $\alpha \sim \mathcal{N}(0,1)$ injects large random visual matrices into the residual stream from step 0, corrupting the hidden states; language-modeling cross-entropy spikes in the first steps and pretrained knowledge is overwritten before useful gradients form.
Perceiver math. $N = 450 \times 256 = 115{,}200$ tokens. Cross-attention costs $O(MN) = 64 \times 115{,}200$ versus self-attention $O(N^2) = 115{,}200^2$ , a ratio of $M/N \approx 1/1800$ — self-attention is ~1800× more expensive and scales quadratically, so the fixed-size latent bottleneck is mandatory for continuous streaming perception.
PaLI multilingual grounding. Joint training places English "red truck" and German "camion rouge" near each other in the shared space (via the text-text link), while image-text training anchors the truck image near the English phrase. By transitivity the image also lands near the German phrase, so a German question retrieves the right visual concept zero-shot — the languages share one visual anchor geometrically.
Tactile fusion. While vision is blocked the visual tokens carry little task signal; on contact the tactile tokens become highly informative, so the softmax attention mass shifts from the visual to the tactile tokens. Because attention is a learned soft selection, the low-similarity blocked visual tokens receive low weight and cannot poison the latent — the model routes around them.
Drone system design. Continuous 4K video plus audio plus long multilingual reports call for a Perceiver/Resampler frontend (compress unbounded streams into fixed latents for effectively infinite context) feeding a PaLI-style multilingual encoder-decoder backend (native multilingual generation), with Flamingo-style gated cross-attention for interleaved streaming. LLaVA's prepend-all-tokens design alone explodes. Optimal: Perceiver frontend → PaLI backend.

Looking ahead#

With the major VLM architectural families established, a practical engineering question emerges: how do we adapt these massive 7B to 70B parameter models to highly specific, proprietary downstream tasks without the millions of dollars required to retrain them?

Week 8: Fine-Tuning and Parameter-Efficient Methods. We derive the mathematics behind Low-Rank Adaptation (LoRA), examine QLoRA's 4-bit quantization approach that enables single-GPU fine-tuning, and analyze the strict tradeoffs between full fine-tuning, LoRA, and adapter methods across physical AI domains.

Purpose of this lecture#

Flamingo: Cross-attention at depth#

Architecture#

The Mathematics of Gated Cross-Attention#

h' = h + \tanh(\alpha) \cdot \text{CrossAttn}(Q=h, K=V=F_\text{visual})

Multi-image and temporal modeling in Flamingo#

The architectural genius of Flamingo's depth-distributed cross-attention is its native support for multi-image interleaved contexts and video.

Perceiver-style architectures (The Latent Bottleneck)#

As the sequence length $N$ of inputs grows (e.g., high-resolution 4K video or multi-camera arrays on a robot), standard self-attention collapses due to its $O(N^2)$ memory and compute scaling.

Perceiver (Jaegle et al., 2021) and its derivatives solve this by introducing an asymmetric latent bottleneck. The architecture relies on two distinct operations:

L = \text{CrossAttn}(Q=\text{latents}, K=V=X)

The mathematical complexity of this operation is $O(M \cdot N)$ . Because $M$ is small and fixed, this operation scales strictly linearly with the input length $N$ .

PaLI: Unified Vision-Language Encoder-Decoder#

Architecture#

Multilingual Multitask Generalization#

Beyond Vision: Tactile and Non-Visual Modalities#

Alternative architectures are now being expanded into Multimodal Transformers that process force, touch, and audio time-series alongside vision:

Architectural Comparison#

Key takeaways#

Conceptual questions#

Gated Attention Initialization: Flamingo initializes its cross-attention gate with $\alpha = 0$ , meaning $h' = h + 0$ . Explain the mathematical severity of catastrophic forgetting if $\alpha$ was instead initialized from a standard Gaussian distribution $\mathcal{N}(0, 1)$ . Specifically, how would the introduction of random visual matrices into the residual stream immediately impact the LLM's cross-entropy loss on language modeling during the first 100 steps of training?
Perceiver Resampler Mathematics: Consider a robotics application streaming 5 cameras at 30fps. Over a 3-second window, this generates 450 frames. If each frame produces 256 ViT patch tokens, calculate the total input sequence length $N$ . If this is fed into a Perceiver Resampler with $M=64$ latent queries, calculate the ratio of operations required for the cross-attention bottleneck $O(MN)$ compared to attempting full self-attention $O(N^2)$ . Why is the Perceiver mathematically mandatory for continuous embodied agents?
PaLI Multilingual Grounding: PaLI trains jointly on 100+ languages. Suppose the model is trained on image-text pairs of "a red truck" in English, and purely text-text translation pairs linking English "red truck" to German "camion rouge" (Note: assuming a hypothetical linguistic link). Explain the geometric mechanism in the shared embedding space that allows PaLI to accurately answer questions about an image of a red truck in German (zero-shot visual transfer), despite never seeing a German caption paired with a physical image during training.
Tactile Modality Fusion: You are designing a multimodal transformer for a robotic hand. The prompt is: "Does this fabric feel like silk or burlap?" The visual camera is completely blocked by the robot's own fingers. Using the Perceiver latent fusion paradigm $L = \text{CrossAttn}(Q, [X_\text{vision}, X_\text{tactile}])$ , describe what specific shifts you expect to observe in the softmax attention weights over the $[X_\text{vision}, X_\text{tactile}]$ sequence right as the robot makes physical contact with the fabric. How does the architecture prevent the blocked visual tokens from poisoning the prediction?
Architectural System Design: Construct a decision framework for a startup building AI for autonomous drones inspecting wind turbines. The drones capture continuous 4K video, record audio (listening for blade cracks), and must generate detailed, multi-paragraph engineering reports in 3 languages. Evaluate LLaVA, Flamingo, Perceiver, and PaLI against these constraints. Which specific combination of architectural components (e.g., a Perceiver frontend attached to a PaLI backend) provides the optimal balance of infinite-context handling and multilingual generation?

Solutions

Gated-attention init. With $\alpha=0$ the cross-attention is a no-op at start ( $h'=h$ ), so the LLM begins exactly as the pretrained model and nothing is forgotten. Initializing $\alpha \sim \mathcal{N}(0,1)$ injects large random visual matrices into the residual stream from step 0, corrupting the hidden states; language-modeling cross-entropy spikes in the first steps and pretrained knowledge is overwritten before useful gradients form.
Perceiver math. $N = 450 \times 256 = 115{,}200$ tokens. Cross-attention costs $O(MN) = 64 \times 115{,}200$ versus self-attention $O(N^2) = 115{,}200^2$ , a ratio of $M/N \approx 1/1800$ — self-attention is ~1800× more expensive and scales quadratically, so the fixed-size latent bottleneck is mandatory for continuous streaming perception.
PaLI multilingual grounding. Joint training places English "red truck" and German "camion rouge" near each other in the shared space (via the text-text link), while image-text training anchors the truck image near the English phrase. By transitivity the image also lands near the German phrase, so a German question retrieves the right visual concept zero-shot — the languages share one visual anchor geometrically.
Tactile fusion. While vision is blocked the visual tokens carry little task signal; on contact the tactile tokens become highly informative, so the softmax attention mass shifts from the visual to the tactile tokens. Because attention is a learned soft selection, the low-similarity blocked visual tokens receive low weight and cannot poison the latent — the model routes around them.
Drone system design. Continuous 4K video plus audio plus long multilingual reports call for a Perceiver/Resampler frontend (compress unbounded streams into fixed latents for effectively infinite context) feeding a PaLI-style multilingual encoder-decoder backend (native multilingual generation), with Flamingo-style gated cross-attention for interleaved streaming. LLaVA's prepend-all-tokens design alone explodes. Optimal: Perceiver frontend → PaLI backend.

Purpose of this lecture#

Flamingo: Cross-attention at depth#

Architecture#

The Mathematics of Gated Cross-Attention#

Multi-image and temporal modeling in Flamingo#

Perceiver-style architectures (The Latent Bottleneck)#

PaLI: Unified Vision-Language Encoder-Decoder#

Architecture#

Multilingual Multitask Generalization#

Beyond Vision: Tactile and Non-Visual Modalities#

Architectural Comparison#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#

Week 7: Alternative VLM Architectures

Purpose of this lecture#

Flamingo: Cross-attention at depth#

Architecture#

The Mathematics of Gated Cross-Attention#

Multi-image and temporal modeling in Flamingo#

Perceiver-style architectures (The Latent Bottleneck)#

PaLI: Unified Vision-Language Encoder-Decoder#

Architecture#

Multilingual Multitask Generalization#

Beyond Vision: Tactile and Non-Visual Modalities#

Architectural Comparison#

Key takeaways#

Conceptual questions#

Looking ahead#

Further reading#