Multimodal AI — Core Concepts

How CLIP, GPT-4V, and Gemini process images alongside text — the architectural choices that enabled AI to understand the visual world and reason about it in language.

The Alignment Problem Between Modalities

Images and text occupy completely different representational spaces. A 224×224 pixel image is a 150,528-dimensional vector of RGB values. A sentence is a sequence of discrete tokens. Making these two types of data “talk to each other” requires finding a shared representational space where similar concepts in different modalities end up nearby.

This is the core challenge of multimodal AI, and different architectures solve it differently.

CLIP: Contrastive Pretraining

OpenAI’s CLIP (Radford et al., 2021) set the template for vision-language alignment. The training approach was elegant: scrape 400 million (image, text caption) pairs from the internet — images with their alt-text, social media posts, image descriptions. These naturally paired examples train two encoders simultaneously:

Image encoder: A ViT or ResNet that produces an image embedding
Text encoder: A transformer that produces a text embedding

Training uses contrastive loss: for a batch of $N$ (image, text) pairs, maximize the similarity between correct pairs while minimizing similarity between incorrect pairs (the $N^2 - N$ mismatches in the batch).

$$\mathcal{L} = -\frac{1}{2N}\sum_{i=1}^N \left[\log \frac{e^{s(I_i, T_i)/\tau}}{\sum_j e^{s(I_i, T_j)/\tau}} + \log \frac{e^{s(T_i, I_i)/\tau}}{\sum_j e^{s(T_j, I_i)/\tau}}\right]$$

Where $s(I, T)$ is cosine similarity between image and text embeddings, and $\tau$ is a learned temperature.

The result: a shared embedding space where “a photo of a dog” and an actual photo of a dog end up near each other. Zero-shot classification becomes possible — classify any image by finding which text description it’s nearest to.

From CLIP to GPT-4V: Visual Question Answering

CLIP established a shared embedding space but doesn’t generate text. The next step: combine CLIP-style visual encoding with a large language model capable of generating responses.

LLaVA (Large Language and Vision Assistant, 2023): A simple but effective approach. Use a pretrained CLIP image encoder, project image features into the LLM’s token space, and fine-tune an instruction-following LLM (originally LLaMA) to process both image features and text together.

Image → CLIP Encoder → Linear Projection → [Image Tokens]
                                              ↓
Text tokens + Image tokens → LLM → Generated response

LLaVA required only 150,000 multimodal instruction-following examples (generated by GPT-4) to achieve strong performance on vision-language tasks.

GPT-4V (OpenAI, 2023): The details aren’t fully public, but the broad approach is similar: a powerful vision encoder produces image representations that are fed into the GPT-4 language model alongside text tokens. Key capabilities:

Reading and reasoning about text in images (OCR + understanding)
Analyzing charts, graphs, and diagrams
Multi-image reasoning (comparing images)
Code and math understanding from screenshots

Gemini (Google, 2023): Trained natively multimodal from the start — rather than adding vision to an existing language model, Gemini was trained on interleaved image-text-audio-video data from scratch. This native multimodality allows it to process video natively (sequences of frames with audio) rather than treating video as separate modalities.

Instruction Tuning for Multimodal Models

Raw pretraining produces models that can encode vision and text together but don’t follow instructions well. Multimodal instruction tuning fine-tunes the model on examples like:

[Image: a pie chart of market share]
User: What company has the largest market share?
Assistant: According to the chart, Company A has the largest share at 34%.

Datasets like LLaVA-Instruct, InstructBLIP, and ShareGPT4V provide these examples at scale. The quality of instruction tuning data has a large effect on downstream task performance — GPT-4 generated synthetic examples dramatically outperform human-written examples for training.

Audio and Video Multimodality

Image-text is the most developed modality pair, but multimodal AI extends further:

Audio-language: Whisper (OpenAI, 2022) transcribes audio using transformer architecture similar to machine translation. AudioPaLM combines speech and language models. OpenAI’s GPT-4o (May 2024) processes audio directly — it can detect tone, emotion, and background sounds rather than just transcribing words.

Video: Video is the most challenging modality — a 1-minute video at 30 fps is 1,800 frames plus audio. Approaches:

Sparse frame sampling: Select key frames, treat as multiple images
Video encoders: 3D convolutions or temporal transformers over dense frame sequences
Long video: Retrieval-augmented approaches that index frames and retrieve relevant ones per query

Google’s VideoPoet (2023) generates video from text; Meta’s Video-LLaMA processes video as sequences of visual tokens.

Common Benchmarks

VQAv2: Visual Question Answering — 1.1M questions about images. Baselines and GPT-4V scores track progress.

MMBench: Multi-dimensional multimodal evaluation across perception, reasoning, and domain-specific tasks.

MMMU: Massive Multidisciplinary Multimodal Understanding — college-level questions requiring visual + domain knowledge.

TextVQA: Reading text within images. Significantly harder than natural image VQA.

By 2024, GPT-4V and Gemini Ultra scored 60–70% on MMMU, comparable to human performance. For OCR-heavy tasks, both approached human-level performance.

One thing to remember: The key insight behind multimodal AI is building a shared representational space where concepts in different modalities are aligned — once that alignment exists, reasoning across modalities becomes tractable.

multimodal-aiclipgpt-4vgeminivision-languagecontrastive-learning