Multimodal AI — Deep Dive

CLIP contrastive loss mathematics, visual tokenization, cross-attention fusion vs. prefix token approaches, hallucination in vision-language models, and native vs. added multimodality.

CLIP Training: Mathematical Details

CLIP’s contrastive objective is a symmetric cross-entropy loss over cosine similarities. Given $N$ image-text pairs:

Compute normalized image embeddings $I_i = f_\theta(x_i) / |f_\theta(x_i)|$
Compute normalized text embeddings $T_i = g_\phi(c_i) / |g_\phi(c_i)|$
Compute similarity matrix $S_{ij} = I_i \cdot T_j^T \cdot \exp(\tau)$

Where $\tau$ is a learned log temperature. The loss:

$$\mathcal{L} = -\frac{1}{2N}\left[\sum_i \log \frac{e^{S_{ii}}}{\sum_j e^{S_{ij}}} + \sum_i \log \frac{e^{S_{ii}}}{\sum_j e^{S_{ji}}}\right]$$

The two terms enforce: (1) for each image, its paired text should score highest; (2) for each text, its paired image should score highest.

Why large batch sizes matter: The negative examples in each batch serve as contrast. With $N=32,768$ (CLIP’s actual batch size), each sample is compared against 32,767 negatives. Larger batches provide harder negatives and better training signal. This is why CLIP required significant compute — training on 400M pairs with large batches.

OpenCLIP reproduced CLIP results at scale (ViT-L/14 on LAION-2B) and showed that training compute and dataset scale matter more than architecture details for contrastive pretraining.

Visual Tokenization: How Images Become Tokens

Patch-based tokenization (ViT approach): A $224 \times 224$ image divided into $16 \times 16$ patches gives 196 patch tokens. Each patch is linearly projected to the model’s embedding dimension. These 196 tokens are fed into the transformer alongside a [CLS] token for classification.

The resolution-quality tradeoff: larger patches (larger stride) → fewer tokens → faster but lower resolution. ViT-B/16 uses $224/16 = 14 \times 14 = 196$ tokens; ViT-B/32 uses 49 tokens.

For high-resolution inputs, the number of visual tokens grows quadratically with input resolution — a 1024×1024 image with 16×16 patches gives 4,096 tokens. Feeding these into an LLM alongside text tokens is computationally expensive.

Solutions to the token bottleneck:

Pooling/compression: LLaVA v1.5 uses an MLP connector that reduces 256 CLIP tokens to 256 language model tokens. More advanced: Q-Former (BLIP-2) uses a set of learned query tokens to attend to image features, compressing to a fixed number regardless of input size.
Dynamic resolution: LLaVA-NeXT processes images at multiple scales, adding detail where text or fine structures are present.
Native resolution: InternVL and similar models dynamically tile high-resolution images into sub-images and process each with CLIP, enabling high-fidelity OCR.

Cross-Attention vs. Prefix Token Approaches

Two main architectural philosophies for fusing visual and language information:

Prefix token approach (LLaVA, GPT-4V): Image features are projected into the LLM’s token space and prepended to the text token sequence. The LLM processes image tokens and text tokens with the same self-attention.

Pros: Simple, leverages the full power of the pretrained LLM’s attention mechanism. Cons: Image tokens consume context window budget; no architectural distinction between image and text processing.

Cross-attention approach (Flamingo, Idefics): The LLM architecture is modified with cross-attention layers that query image features from text representations. The image features are kept separate from the text token stream.

$$\text{CrossAttn}(Q_{text}, K_{img}, V_{img}) = \text{softmax}\left(\frac{Q_{text} K_{img}^T}{\sqrt{d}}\right) V_{img}$$

Pros: Scales better to many images or high-resolution images (cross-attention doesn’t consume context window); better for interleaved image-text documents. Cons: Requires modifying the base LLM architecture; can’t directly use an off-the-shelf pretrained LLM.

Flamingo (Deepmind, 2022): Used cross-attention for few-shot multimodal learning — the model could process sequences of interleaved images and text, making it effective for tasks like: “here are 3 examples of diagrams with captions, now caption this new diagram.”

Visual Grounding and Spatial Reasoning

A major limitation of current vision-language models: they generate plausible-sounding responses about images but lack reliable spatial grounding.

Visual grounding benchmarks:

RefCOCO: Given “the person in the red shirt on the left”, return a bounding box
Winoground: Test compositional understanding — distinguishing “a dog chasing a ball” from “a ball chasing a dog” from images

GPT-4V and similar models struggle with precise spatial reasoning. They can describe what’s in an image accurately but have trouble with:

Counting objects precisely (>5 similar objects)
Identifying exact positions (left vs. right, above vs. below)
Fine-grained attribute binding (which specific object has which attribute)

Grounding-specialized architectures: Grounding DINO (2023) combines a language model with an object detection backbone to produce grounded outputs — bounding boxes alongside textual descriptions. GLaMM (Pixel Grounding Large Multimodal Model) generates segmentation masks from natural language.

Object Hallucination in VLMs

Vision-language models frequently “hallucinate” objects that aren’t present in the image — generating confident descriptions of things they didn’t actually see. This is distinct from language model hallucination: it’s specifically about generating incorrect visual claims.

CHAIR metric (Caption Hallucination Assessment with Image Relevance): Measures what fraction of objects mentioned in generated captions are actually present in the ground-truth object annotations.

Causes of hallucination:

Language prior dominance: If “kitchen” appears in the context, the model may generate “stove” even if no stove is visible, because stoves commonly appear in kitchens in training data.
Weak visual-text binding: The image features have insufficient influence on the generation — text patterns dominate.
Training data imbalance: Objects that commonly co-occur get wrongly associated.

Mitigations:

RLHF-style feedback with human raters flagging hallucinations
Contrastive decoding: generate with and without image conditioning, penalize outputs that score high without the image
Better visual grounding during fine-tuning (providing bounding box annotations for objects mentioned in captions)

Native vs. Added Multimodality

Added multimodality (LLaVA, InstructBLIP): Start with a pretrained language model, add a visual encoder, connect them with a projection layer or cross-attention module. Fine-tune the connection while keeping the LLM mostly frozen.

Pros: Leverages powerful pretrained LLMs; relatively cheap to develop. Cons: Visual processing is a “bolt-on” — the model’s internal representations are optimized for text, and visual features must be crammed into that representation space.

Native multimodality (Gemini, Grok-1.5V): Trained from scratch on interleaved multimodal data. The model’s internal representations are jointly optimized for all modalities.

The Gemini technical report (Google, Dec 2023) reported that native multimodal training outperformed CLIP-initialized models on various benchmarks, particularly for tasks requiring fine-grained visual understanding. However, native multimodal pretraining requires enormous compute and multimodal datasets at scale.

GPT-4o (May 2024) claimed to process “any combination of text, audio, image, and video as input” with “any combination of text, audio, and image as output” — suggesting a more deeply native multimodal architecture than GPT-4V.

One thing to remember: The core unsolved problem in multimodal AI is reliable visual grounding — ensuring that model outputs about images are driven by what’s actually in the image rather than statistical patterns from pretraining. Everything else in the field flows from this challenge.

multimodal-aiclipcontrastive-learningvisual-groundinghallucinationgemini