Purpose of this lecture
The projector-based design (BLIP-2, LLaVA) dominates the current VLMVision-Language Model landscape due to its simplicity and ease of training on consumer hardware. However, it suffers from fundamental structural limitations: visual tokens are injected exactly once at the input, and the LLMLarge Language Model must mathematically propagate this visual context forward through all its self-attention layers without any fresh visual access. For tasks involving multiple images, long video temporal sequences, or continuous streaming perception, this design catastrophically degrades because the visual information becomes diluted or exceeds the LLMLarge Language Model's fixed context window.
This lecture explores the frontier of VLMVision-Language Model architectures designed to bypass these bottlenecks. We examine Flamingo's depth-distributed gated cross-attention, the Perceiver's latent bottleneck for infinite-length sequences, PaLI's unified multilingual encoder-decoder, and finally, how these alternative architectures extend beyond vision to incorporate tactile and non-visual modalities for physical AI.
Flamingo: Cross-attention at depth
Flamingo (Alayrac et al., 2022) addresses the context-dilution problem by structurally intertwining the visual and linguistic pathways. Instead of appending visual tokens to the input prompt, Flamingo interleaves cross-attention layers throughout the frozen language model.
Architecture
Flamingo starts with a frozen vision encoder (NFNet-F6 or CLIP) to produce raw visual features for each image in the context. Because an image might have spatial tokens, and a prompt might contain images, passing tokens into the LLMLarge Language Model is unscalable.
To solve this, Flamingo introduces the Perceiver Resampler, which compresses any variable-length visual feature sequence into a strict, fixed number of visual tokens. These tokens are stored in a standalone "visual memory" cache.
The LLMLarge Language Model (e.g., Chinchilla-70B) is kept strictly frozen. However, new, randomly initialized Gated Cross-Attention layers are inserted before every -th transformer block (typically every 4th or 7th block). During generation, at these specific depths, the language tokens ACTAction Chunking with Transformers as Queries to attend to the Keys and Values stored in the visual memory cache.
The Mathematics of Gated Cross-Attention
Training a massive LLMLarge Language Model with newly injected, randomly initialized attention layers usually causes catastrophic forgetting—the random gradients destroy the LLMLarge Language Model's pretrained representations. Flamingo solves this using a zero-initialized gate:
Here, is a learnable scalar parameter that is explicitly initialized to . At step of training, , meaning the cross-attention output is completely zeroed out. The model behaves mathematically identically to the original frozen LLMLarge Language Model. During training, the gradients slowly push away from zero, allowing the model to smoothly and gradually learn to incorporate visual information without destabilizing the linguistic priors.
Multi-image and temporal modeling in Flamingo
The architectural genius of Flamingo's depth-distributed cross-attention is its native support for multi-image interleaved contexts and video.
In a projector-based model, processing images requires concatenating visual tokens into the input sequence, consuming the LLMLarge Language Model's context window quadratically (). In Flamingo, the LLMLarge Language Model only processes the text tokens. When the text tokens reach a gated cross-attention layer, they apply attention over a fixed visual memory bank.
To handle interleaved image-text formats (e.g., a web page with text, an image, more text, another image), Flamingo applies a learned causal masking matrix to the cross-attention layer. A text token is only allowed to attend to the visual memories of the image that immediately preceded it in the sequence.
For video, individual frames are processed independently by the vision encoder and Perceiver Resampler, yielding tokens. Frame-specific 1D temporal positional embeddings are added. The language model can now generate a description of the entire video by dynamically attending to any frame's visual memory at any layer, effectively allowing early text generation steps to query the beginning of the video, and later steps to query the end.
Perceiver-style architectures (The Latent Bottleneck)
As the sequence length of inputs grows (e.g., high-resolution 4K video or multi-camera arrays on a robot), standard self-attention collapses due to its memory and compute scaling.
Perceiver (Jaegle et al., 2021) and its derivatives solve this by introducing an asymmetric latent bottleneck. The architecture relies on two distinct operations:
1. Cross-attention from Latents to Inputs (The Bottleneck): Instead of the inputs attending to themselves, the network initializes latent queries (where , e.g., ). These queries attend to the massive input sequence :
The mathematical complexity of this operation is . Because is small and fixed, this operation scales strictly linearly with the input length .
2. Self-attention within the Latent Array: After the cross-attention step distills the massive input into the compact latents, all subsequent transformer blocks operate only on the latent tokens. The self-attention cost is , which is entirely independent of the raw input length .
The total computational complexity drops from to . This is the exact mechanism inside Flamingo's Perceiver Resampler, and it is the only way current VLMs can process thousands of raw video frames without running out of GPU memory.
PaLI: Unified Vision-Language Encoder-Decoder
PaLI (Pathways Language and Image model; Chen et al., 2023) represents the philosophical opposite of BLIP-2 and LLaVA. Rather than separating vision and language into distinct, modular, frozen backbones connected by a tiny learned interface, PaLI trains a massive, unified encoder-decoder architecture jointly on both modalities from scratch.
Architecture
PaLI uses a giant 4B-parameter ViT-e image encoder to extract dense visual tokens. Crucially, it does not use a Q-Former or Perceiver. It passes the raw visual tokens directly into a massive (up to 540B parameters) mT5-based encoder-decoder text model. The text encoder processes the visual tokens and text tokens identically, and the text decoder autoregressively generates the response. All parameters (both vision and language) are co-trained.
Multilingual Multitask Generalization
PaLI is trained simultaneously on a mixture of over 100 languages across tasks like VQA, OCR, captioning, and image classification. This unified multi-task, multi-lingual objective forces a profound geometric alignment: the visual representation of an "apple" is mathematically aligned simultaneously with the English token "apple," the French token "pomme," and the Spanish token "pomme." Consequently, PaLI exhibits zero-shot transfer across languages—if it is fine-tuned to perform a complex visual task purely in English, it can instantly execute that visual task when prompted in Japanese, relying entirely on the unified semantic space it built during pretraining.
Beyond Vision: Tactile and Non-Visual Modalities
Current VLMs are predominantly "Vision-Language." However, physical AI and robotics are fundamentally about contact. A robot cannot effectively screw in a bolt, fold a deformable cloth, or grasp a fragile egg relying solely on an RGB camera feed that gets occluded by the robot's own arm.
Alternative architectures are now being expanded into Multimodal Transformers that process force, touch, and audio time-series alongside vision:
1. Tactile Sensors (GelSight / DIGIT): Modern robotic fingertips use elastomer-based cameras that capture high-resolution deformation topography when pressing into an object. Architecturally, these readings are processed as localized video streams. A ViT processes the tactile frames, and a Perceiver bottleneck compresses them into "tactile tokens" that are interleaved with the standard visual tokens in the LLMLarge Language Model's context window.
2. Force-Torque (F/T) Sequences: A robot's wrist sensors output continuous 6-DoF data . Rather than treating this as vision, it is treated as a 1D sequence. A 1D Convolutional network or an LSTM encodes the temporal history of the forces into a distinct set of "proprioceptive tokens."
3. Architectural Fusion: To fuse these modalities, networks often utilize the Perceiver architecture. The fixed latent array is allowed to cross-attend to the concatenated massive sequence of . This allows the network's attention mechanism to dynamically weight whichever modality is currently providing the strongest signal (e.g., relying purely on vision while reaching for an object, but immediately shifting 90% of the attention weights to the tactile and force tokens the millisecond the gripper makes contact).
Architectural Comparison
| Architecture | Visual injection | Multi-image support | Parameter count (trainable) | Compute cost | |---|---|---|---|---| | LLaVA-style | Once at input | Limited (quadratic context explosion) | Projector + LLMLarge Language Model LoRA | Low | | Flamingo | At every -th LLMLarge Language Model layer | Native (via visual memory bank) | Gated Cross-attn layers | Medium | | Perceiver | Latent compression | Infinite inputs (linear scaling) | Latent Queries + processor | Medium | | PaLI | Unified joint training | Via interleaved tokens | All parameters (500B+) | Very High |
There is no single "best" architecture. The optimal choice depends entirely on deployment constraints: whether the application requires streaming video support (Flamingo/Perceiver), multilingual spatial reasoning (PaLI), or fast, single-image inference on a localized edge-compute board (LLaVA).
Key takeaways
Alternative VLMVision-Language Model architectures bypass the limitations of simple linear projectors. Flamingo inserts zero-initialized, gated cross-attention layers throughout a frozen LLMLarge Language Model, creating a persistent "visual memory" that effortlessly scales to multi-image and video reasoning without quadratic context explosion. Perceiver architectures decouple computational complexity from input length by forcing massive multimodal inputs (vision, tactile, audio) through a tiny, fixed-size latent bottleneck, achieving linear scaling. PaLI discards frozen modularity entirely, training a monolithic encoder-decoder that binds visual features to over 100 languages simultaneously. Ultimately, moving physical AI into the real world requires architectures capable of fusing high-frequency, non-visual modalities (like force and touch) into the same semantic latent space.
Conceptual questions
- Gated Attention Initialization: Flamingo initializes its cross-attention gate with , meaning . Explain the mathematical severity of catastrophic forgetting if was instead initialized from a standard Gaussian distribution . Specifically, how would the introduction of random visual matrices into the residual stream immediately impact the LLMLarge Language Model's cross-entropy loss on language modeling during the first 100 steps of training?
- Perceiver Resampler Mathematics: Consider a robotics application streaming 5 cameras at 30fps. Over a 3-second window, this generates 450 frames. If each frame produces 256 ViT patch tokens, calculate the total input sequence length . If this is fed into a Perceiver Resampler with latent queries, calculate the ratio of operations required for the cross-attention bottleneck compared to attempting full self-attention . Why is the Perceiver mathematically mandatory for continuous embodied agents?
- PaLI Multilingual Grounding: PaLI trains jointly on 100+ languages. Suppose the model is trained on image-text pairs of "a red truck" in English, and purely text-text translation pairs linking English "red truck" to German "camion rouge" (Note: assuming a hypothetical linguistic link). Explain the geometric mechanism in the shared embedding space that allows PaLI to accurately answer questions about an image of a red truck in German (zero-shot visual transfer), despite never seeing a German caption paired with a physical image during training.
- Tactile Modality Fusion: You are designing a multimodal transformer for a robotic hand. The prompt is: "Does this fabric feel like silk or burlap?" The visual camera is completely blocked by the robot's own fingers. Using the Perceiver latent fusion paradigm , describe what specific shifts you expect to observe in the softmax attention weights over the sequence right as the robot makes physical contact with the fabric. How does the architecture prevent the blocked visual tokens from poisoning the prediction?
- Architectural System Design: Construct a decision framework for a startup building AI for autonomous drones inspecting wind turbines. The drones capture continuous 4K video, record audio (listening for blade cracks), and must generate detailed, multi-paragraph engineering reports in 3 languages. Evaluate LLaVA, Flamingo, Perceiver, and PaLI against these constraints. Which specific combination of architectural components (e.g., a Perceiver frontend attached to a PaLI backend) provides the optimal balance of infinite-context handling and multilingual generation?
Looking ahead
With the major VLMVision-Language Model architectural families established, a practical engineering question emerges: how do we adapt these massive 7B to 70B parameter models to highly specific, proprietary downstream tasks without the millions of dollars required to retrain them?
Week 8: Fine-Tuning and Parameter-Efficient Methods. We derive the mathematics behind Low-Rank Adaptation (LoRA), examine QLoRA's 4-bit quantization approach that enables single-GPU fine-tuning, and analyze the strict tradeoffs between full fine-tuning, LoRA, and adapter methods across physical AI domains.
Further reading
- Alayrac, J.-B., et al. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS. (Gated cross-attention and visual memory).
- Jaegle, A., et al. (2021). Perceiver: General Perception with Iterative Attention. ICML. (The latent bottleneck architecture).
- Chen, X., et al. (2023). PaLI: A Jointly-Scaled Multilingual Language-Image Model. ICLR. (Unified encoder-decoder).