Skip to main content
illumin8
Courses
Week 7: Alternative VLM Architectures
Physical AI
01Week 1: Modern Vision Backbones
02Week 2: Self-Supervised Representation Learning for Vision
03Week 3: Contrastive Vision–Language Learning (CLIP)
04Week 4: Beyond CLIP — Captioning and Grounding
05Week 5: BLIP, BLIP-2, and Related Models
06Week 6: LLaVA and Multimodal Instruction Tuning
07Week 7: Alternative VLM Architectures
08Week 8: Fine-Tuning and Parameter-Efficient Methods
09Week 9: Evaluation and Robustness
10Week 10: ControlNet and Controlled Generation
11Week 11: Multimodal Agents and Tool Use
12Week 12: Vision-Language Models for Robotics
13Week 13: Bias, Fairness, and Safety in VLMs
14Week 14: Vision-Language Capstone
Week 7

Week 7: Alternative VLM Architectures

✦Learning Outcomes
  • Implement Perceiver resampler for variable-to-fixed token compression
  • Analyze tradeoffs between context window limitations and computational cost
  • Apply alternative architectures for multi-image and video VLMVision-Language Model tasks
◆Prerequisites
  • Week 5: BLIP-2 - Bridging frozen encoders
  • Week 6: LLaVA - Projector architecture

Purpose of this lecture

The projector-based design (BLIP-2, LLaVA) dominates the current VLMVision-Language Model landscape due to its simplicity and ease of training on consumer hardware. However, it suffers from fundamental structural limitations: visual tokens are injected exactly once at the input, and the LLMLarge Language Model must mathematically propagate this visual context forward through all its self-attention layers without any fresh visual access. For tasks involving multiple images, long video temporal sequences, or continuous streaming perception, this design catastrophically degrades because the visual information becomes diluted or exceeds the LLMLarge Language Model's fixed context window.

This lecture explores the frontier of VLMVision-Language Model architectures designed to bypass these bottlenecks. We examine Flamingo's depth-distributed gated cross-attention, the Perceiver's latent bottleneck for infinite-length sequences, PaLI's unified multilingual encoder-decoder, and finally, how these alternative architectures extend beyond vision to incorporate tactile and non-visual modalities for physical AI.


Flamingo: Cross-attention at depth

Flamingo (Alayrac et al., 2022) addresses the context-dilution problem by structurally intertwining the visual and linguistic pathways. Instead of appending visual tokens to the input prompt, Flamingo interleaves cross-attention layers throughout the frozen language model.

Architecture

Flamingo starts with a frozen vision encoder (NFNet-F6 or CLIP) to produce raw visual features for each image in the context. Because an image might have 102410241024 spatial tokens, and a prompt might contain KKK images, passing 1024×K1024 \times K1024×K tokens into the LLMLarge Language Model is unscalable.

To solve this, Flamingo introduces the Perceiver Resampler, which compresses any variable-length visual feature sequence into a strict, fixed number of M=64M = 64M=64 visual tokens. These 646464 tokens are stored in a standalone "visual memory" cache.

The LLMLarge Language Model (e.g., Chinchilla-70B) is kept strictly frozen. However, new, randomly initialized Gated Cross-Attention layers are inserted before every kkk-th transformer block (typically every 4th or 7th block). During generation, at these specific depths, the language tokens ACTAction Chunking with Transformers as Queries to attend to the Keys and Values stored in the visual memory cache.

The Mathematics of Gated Cross-Attention

Training a massive LLMLarge Language Model with newly injected, randomly initialized attention layers usually causes catastrophic forgetting—the random gradients destroy the LLMLarge Language Model's pretrained representations. Flamingo solves this using a zero-initialized gate:

h′=h+tanh⁡(α)⋅CrossAttn(Q=h,K=V=Fvisual)h' = h + \tanh(\alpha) \cdot \text{CrossAttn}(Q=h, K=V=F_\text{visual})h′=h+tanh(α)⋅CrossAttn(Q=h,K=V=Fvisual​)

Here, α\alphaα is a learnable scalar parameter that is explicitly initialized to 000. At step t=0t=0t=0 of training, tanh⁡(0)=0\tanh(0) = 0tanh(0)=0, meaning the cross-attention output is completely zeroed out. The model h′=hh' = hh′=h behaves mathematically identically to the original frozen LLMLarge Language Model. During training, the gradients slowly push α\alphaα away from zero, allowing the model to smoothly and gradually learn to incorporate visual information without destabilizing the linguistic priors.


Multi-image and temporal modeling in Flamingo

The architectural genius of Flamingo's depth-distributed cross-attention is its native support for multi-image interleaved contexts and video.

In a projector-based model, processing K=5K=5K=5 images requires concatenating 5×256=12805 \times 256 = 12805×256=1280 visual tokens into the input sequence, consuming the LLMLarge Language Model's context window quadratically (O(L2)O(L^2)O(L2)). In Flamingo, the LLMLarge Language Model only processes the text tokens. When the text tokens reach a gated cross-attention layer, they apply attention over a fixed K×64K \times 64K×64 visual memory bank.

To handle interleaved image-text formats (e.g., a web page with text, an image, more text, another image), Flamingo applies a learned causal masking matrix to the cross-attention layer. A text token is only allowed to attend to the visual memories of the image that immediately preceded it in the sequence.

For video, individual frames are processed independently by the vision encoder and Perceiver Resampler, yielding T×64T \times 64T×64 tokens. Frame-specific 1D temporal positional embeddings are added. The language model can now generate a description of the entire video by dynamically attending to any frame's visual memory at any layer, effectively allowing early text generation steps to query the beginning of the video, and later steps to query the end.


Perceiver-style architectures (The Latent Bottleneck)

As the sequence length NNN of inputs grows (e.g., high-resolution 4K video or multi-camera arrays on a robot), standard self-attention collapses due to its O(N2)O(N^2)O(N2) memory and compute scaling.

Perceiver (Jaegle et al., 2021) and its derivatives solve this by introducing an asymmetric latent bottleneck. The architecture relies on two distinct operations:

1. Cross-attention from Latents to Inputs (The Bottleneck): Instead of the inputs attending to themselves, the network initializes MMM latent queries Q∈RM×DQ \in \mathbb{R}^{M \times D}Q∈RM×D (where M≪NM \ll NM≪N, e.g., M=256M=256M=256). These queries attend to the massive input sequence X∈RN×DinX \in \mathbb{R}^{N \times D_\text{in}}X∈RN×Din​:

L=CrossAttn(Q=latents,K=V=X)L = \text{CrossAttn}(Q=\text{latents}, K=V=X)L=CrossAttn(Q=latents,K=V=X)

The mathematical complexity of this operation is O(M⋅N)O(M \cdot N)O(M⋅N). Because MMM is small and fixed, this operation scales strictly linearly with the input length NNN.

2. Self-attention within the Latent Array: After the cross-attention step distills the massive input XXX into the compact MMM latents, all subsequent transformer blocks operate only on the MMM latent tokens. The self-attention cost is O(M2)O(M^2)O(M2), which is entirely independent of the raw input length NNN.

The total computational complexity drops from O(N2)O(N^2)O(N2) to O(M⋅N+L⋅M2)O(M \cdot N + L \cdot M^2)O(M⋅N+L⋅M2). This is the exact mechanism inside Flamingo's Perceiver Resampler, and it is the only way current VLMs can process thousands of raw video frames without running out of GPU memory.


PaLI: Unified Vision-Language Encoder-Decoder

PaLI (Pathways Language and Image model; Chen et al., 2023) represents the philosophical opposite of BLIP-2 and LLaVA. Rather than separating vision and language into distinct, modular, frozen backbones connected by a tiny learned interface, PaLI trains a massive, unified encoder-decoder architecture jointly on both modalities from scratch.

Architecture

PaLI uses a giant 4B-parameter ViT-e image encoder to extract dense visual tokens. Crucially, it does not use a Q-Former or Perceiver. It passes the raw visual tokens directly into a massive (up to 540B parameters) mT5-based encoder-decoder text model. The text encoder processes the visual tokens and text tokens identically, and the text decoder autoregressively generates the response. All parameters (both vision and language) are co-trained.

Multilingual Multitask Generalization

PaLI is trained simultaneously on a mixture of over 100 languages across tasks like VQA, OCR, captioning, and image classification. This unified multi-task, multi-lingual objective forces a profound geometric alignment: the visual representation of an "apple" is mathematically aligned simultaneously with the English token "apple," the French token "pomme," and the Spanish token "pomme." Consequently, PaLI exhibits zero-shot transfer across languages—if it is fine-tuned to perform a complex visual task purely in English, it can instantly execute that visual task when prompted in Japanese, relying entirely on the unified semantic space it built during pretraining.


Beyond Vision: Tactile and Non-Visual Modalities

Current VLMs are predominantly "Vision-Language." However, physical AI and robotics are fundamentally about contact. A robot cannot effectively screw in a bolt, fold a deformable cloth, or grasp a fragile egg relying solely on an RGB camera feed that gets occluded by the robot's own arm.

Alternative architectures are now being expanded into Multimodal Transformers that process force, touch, and audio time-series alongside vision:

1. Tactile Sensors (GelSight / DIGIT): Modern robotic fingertips use elastomer-based cameras that capture high-resolution deformation topography when pressing into an object. Architecturally, these readings are processed as localized video streams. A ViT processes the tactile frames, and a Perceiver bottleneck compresses them into "tactile tokens" that are interleaved with the standard visual tokens in the LLMLarge Language Model's context window.

2. Force-Torque (F/T) Sequences: A robot's wrist sensors output continuous 6-DoF data (Fx,Fy,Fz,τx,τy,τz)(F_x, F_y, F_z, \tau_x, \tau_y, \tau_z)(Fx​,Fy​,Fz​,τx​,τy​,τz​). Rather than treating this as vision, it is treated as a 1D sequence. A 1D Convolutional network or an LSTM encodes the temporal history of the forces into a distinct set of "proprioceptive tokens."

3. Architectural Fusion: To fuse these modalities, networks often utilize the Perceiver architecture. The fixed latent array QQQ is allowed to cross-attend to the concatenated massive sequence of X=[Xvision,Xtactile,Xforce,Xaudio]X = [X_\text{vision}, X_\text{tactile}, X_\text{force}, X_\text{audio}]X=[Xvision​,Xtactile​,Xforce​,Xaudio​]. This allows the network's attention mechanism to dynamically weight whichever modality is currently providing the strongest signal (e.g., relying purely on vision while reaching for an object, but immediately shifting 90% of the attention weights to the tactile and force tokens the millisecond the gripper makes contact).


Architectural Comparison

| Architecture | Visual injection | Multi-image support | Parameter count (trainable) | Compute cost | |---|---|---|---|---| | LLaVA-style | Once at input | Limited (quadratic context explosion) | Projector + LLMLarge Language Model LoRA | Low | | Flamingo | At every kkk-th LLMLarge Language Model layer | Native (via visual memory bank) | Gated Cross-attn layers | Medium | | Perceiver | Latent compression | Infinite inputs (linear scaling) | Latent Queries + processor | Medium | | PaLI | Unified joint training | Via interleaved tokens | All parameters (500B+) | Very High |

There is no single "best" architecture. The optimal choice depends entirely on deployment constraints: whether the application requires streaming video support (Flamingo/Perceiver), multilingual spatial reasoning (PaLI), or fast, single-image inference on a localized edge-compute board (LLaVA).


Key takeaways

Alternative VLMVision-Language Model architectures bypass the limitations of simple linear projectors. Flamingo inserts zero-initialized, gated cross-attention layers throughout a frozen LLMLarge Language Model, creating a persistent "visual memory" that effortlessly scales to multi-image and video reasoning without quadratic context explosion. Perceiver architectures decouple computational complexity from input length by forcing massive multimodal inputs (vision, tactile, audio) through a tiny, fixed-size latent bottleneck, achieving O(M⋅N)O(M \cdot N)O(M⋅N) linear scaling. PaLI discards frozen modularity entirely, training a monolithic encoder-decoder that binds visual features to over 100 languages simultaneously. Ultimately, moving physical AI into the real world requires architectures capable of fusing high-frequency, non-visual modalities (like force and touch) into the same semantic latent space.


Conceptual questions

  1. Gated Attention Initialization: Flamingo initializes its cross-attention gate with α=0\alpha = 0α=0, meaning h′=h+0h' = h + 0h′=h+0. Explain the mathematical severity of catastrophic forgetting if α\alphaα was instead initialized from a standard Gaussian distribution N(0,1)\mathcal{N}(0, 1)N(0,1). Specifically, how would the introduction of random visual matrices into the residual stream immediately impact the LLMLarge Language Model's cross-entropy loss on language modeling during the first 100 steps of training?
  2. Perceiver Resampler Mathematics: Consider a robotics application streaming 5 cameras at 30fps. Over a 3-second window, this generates 450 frames. If each frame produces 256 ViT patch tokens, calculate the total input sequence length NNN. If this is fed into a Perceiver Resampler with M=64M=64M=64 latent queries, calculate the ratio of operations required for the cross-attention bottleneck O(MN)O(MN)O(MN) compared to attempting full self-attention O(N2)O(N^2)O(N2). Why is the Perceiver mathematically mandatory for continuous embodied agents?
  3. PaLI Multilingual Grounding: PaLI trains jointly on 100+ languages. Suppose the model is trained on image-text pairs of "a red truck" in English, and purely text-text translation pairs linking English "red truck" to German "camion rouge" (Note: assuming a hypothetical linguistic link). Explain the geometric mechanism in the shared embedding space that allows PaLI to accurately answer questions about an image of a red truck in German (zero-shot visual transfer), despite never seeing a German caption paired with a physical image during training.
  4. Tactile Modality Fusion: You are designing a multimodal transformer for a robotic hand. The prompt is: "Does this fabric feel like silk or burlap?" The visual camera is completely blocked by the robot's own fingers. Using the Perceiver latent fusion paradigm L=CrossAttn(Q,[Xvision,Xtactile])L = \text{CrossAttn}(Q, [X_\text{vision}, X_\text{tactile}])L=CrossAttn(Q,[Xvision​,Xtactile​]), describe what specific shifts you expect to observe in the softmax attention weights over the [Xvision,Xtactile][X_\text{vision}, X_\text{tactile}][Xvision​,Xtactile​] sequence right as the robot makes physical contact with the fabric. How does the architecture prevent the blocked visual tokens from poisoning the prediction?
  5. Architectural System Design: Construct a decision framework for a startup building AI for autonomous drones inspecting wind turbines. The drones capture continuous 4K video, record audio (listening for blade cracks), and must generate detailed, multi-paragraph engineering reports in 3 languages. Evaluate LLaVA, Flamingo, Perceiver, and PaLI against these constraints. Which specific combination of architectural components (e.g., a Perceiver frontend attached to a PaLI backend) provides the optimal balance of infinite-context handling and multilingual generation?
✦Solutions
  1. Gated-attention init. With α=0\alpha=0α=0 the cross-attention is a no-op at start (h′=hh'=hh′=h), so the LLM begins exactly as the pretrained model and nothing is forgotten. Initializing α∼N(0,1)\alpha \sim \mathcal{N}(0,1)α∼N(0,1) injects large random visual matrices into the residual stream from step 0, corrupting the hidden states; language-modeling cross-entropy spikes in the first steps and pretrained knowledge is overwritten before useful gradients form.
  2. Perceiver math. N=450×256=115,200N = 450 \times 256 = 115{,}200N=450×256=115,200 tokens. Cross-attention costs O(MN)=64×115,200O(MN) = 64 \times 115{,}200O(MN)=64×115,200 versus self-attention O(N2)=115,2002O(N^2) = 115{,}200^2O(N2)=115,2002, a ratio of M/N≈1/1800M/N \approx 1/1800M/N≈1/1800 — self-attention is ~1800× more expensive and scales quadratically, so the fixed-size latent bottleneck is mandatory for continuous streaming perception.
  3. PaLI multilingual grounding. Joint training places English "red truck" and German "camion rouge" near each other in the shared space (via the text-text link), while image-text training anchors the truck image near the English phrase. By transitivity the image also lands near the German phrase, so a German question retrieves the right visual concept zero-shot — the languages share one visual anchor geometrically.
  4. Tactile fusion. While vision is blocked the visual tokens carry little task signal; on contact the tactile tokens become highly informative, so the softmax attention mass shifts from the visual to the tactile tokens. Because attention is a learned soft selection, the low-similarity blocked visual tokens receive low weight and cannot poison the latent — the model routes around them.
  5. Drone system design. Continuous 4K video plus audio plus long multilingual reports call for a Perceiver/Resampler frontend (compress unbounded streams into fixed latents for effectively infinite context) feeding a PaLI-style multilingual encoder-decoder backend (native multilingual generation), with Flamingo-style gated cross-attention for interleaved streaming. LLaVA's prepend-all-tokens design alone explodes. Optimal: Perceiver frontend → PaLI backend.

Looking ahead

With the major VLMVision-Language Model architectural families established, a practical engineering question emerges: how do we adapt these massive 7B to 70B parameter models to highly specific, proprietary downstream tasks without the millions of dollars required to retrain them?

Week 8: Fine-Tuning and Parameter-Efficient Methods. We derive the mathematics behind Low-Rank Adaptation (LoRA), examine QLoRA's 4-bit quantization approach that enables single-GPU fine-tuning, and analyze the strict tradeoffs between full fine-tuning, LoRA, and adapter methods across physical AI domains.


Further reading

  • Alayrac, J.-B., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS. (Gated cross-attention and visual memory).
  • Jaegle, A., et al. (2021). Perceiver: General Perception with Iterative Attention. ICML. (The latent bottleneck architecture).
  • Chen, X., et al. (2023). PaLI: A Jointly-Scaled Multilingual Language-Image Model. ICLR. (Unified encoder-decoder).
← Previous
Week 6: LLaVA and Multimodal Instruction Tuning
Next →
Week 8: Fine-Tuning and Parameter-Efficient Methods
On this page
  • Purpose of this lecture
  • Flamingo: Cross-attention at depth
  • Architecture
  • The Mathematics of Gated Cross-Attention
  • Multi-image and temporal modeling in Flamingo
  • Perceiver-style architectures (The Latent Bottleneck)
  • PaLI: Unified Vision-Language Encoder-Decoder
  • Architecture
  • Multilingual Multitask Generalization
  • Beyond Vision: Tactile and Non-Visual Modalities
  • Architectural Comparison
  • Key takeaways
  • Conceptual questions
  • Looking ahead
  • Further reading