Embeddings — Core Concepts

From Words to Numbers (And Why It’s Harder Than It Sounds)

Computers are ruthlessly literal. “bank” the financial institution and “bank” the river edge are the same string of characters to a computer. So are “run” the verb and “run” the noun. And synonyms like “automobile” and “car” are completely unrelated — just two random sequences of letters.

For decades, this was a huge problem. Search engines would miss documents because they used different words for the same thing. Spam filters would fail when spammers swapped “free” for “complimentary.” Translation systems couldn’t handle context.

Embeddings are the fix. They’re a way of representing text (or images, or audio) as a list of numbers — a vector — such that similar things have similar vectors. Once you can measure similarity mathematically, a huge range of problems become tractable.

What a Vector Actually Is

A vector is just a list of numbers. [0.2, -0.7, 0.1, 0.9, ...]. For modern embeddings, this list might be 768, 1536, or 3072 numbers long — one number per “dimension.”

Each dimension doesn’t have a clean human label like “how animal-like is this word?” The dimensions emerge from training and are mostly uninterpretable. But collectively, they encode something real: the word’s meaning in context.

Two vectors are similar if they point in the same direction in that high-dimensional space. The standard measure is cosine similarity: it’s 1.0 if two vectors point exactly the same way, 0 if they’re perpendicular, and -1.0 if they’re opposite. In practice, “cat” and “dog” might score 0.85 cosine similarity. “Cat” and “justice” might score 0.12.

How Embeddings Are Trained

The short version: you show a neural network a huge amount of text, and you train it to predict context.

The original landmark approach was Word2Vec (Google, 2013). The core insight was almost embarrassingly simple: words that appear in similar surrounding contexts probably mean similar things. “I walked my ___” could end with “dog” or “cat” or “llama” — not “democracy.” So dog, cat, and llama should end up nearby.

Word2Vec trained on a task called “predict the surrounding words” (or vice versa). The embeddings were a byproduct of learning to do that task well. This worked so well that the paper spawned hundreds of follow-up methods.

Modern embedding models (like those from OpenAI or Cohere) use transformer architectures trained on far more data with more sophisticated objectives. Crucially, they’re contextual — the same word gets different embeddings depending on surrounding words, so “bank” in “river bank” and “bank” in “bank account” get different vectors.

The King - Man + Woman = Queen Thing

This is the result that made researchers do a double-take when Word2Vec came out in 2013.

Because embeddings capture relationships, you can do arithmetic on them:

embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")

The model was never shown this fact. Nobody hard-coded “king is the male version of queen.” It fell out of the geometry because those relationships are embedded in how those words get used in text.

Other examples that work:

  • Paris - France + Germany ≈ Berlin
  • walked - walk + swim ≈ swam
  • doctor - man + woman ≈ nurse (which also reveals biases in training data — more on that below)

Where Embeddings Show Up Today

Semantic search is the most impactful application. Traditional search matches keywords. Semantic search matches meaning. Search for “how to fix a broken pipe” and get results that mention “plumbing repair” even if they never use your exact words. This is how every modern search engine and AI assistant retrieves information.

Recommendation systems use embeddings to find similar content. Spotify’s embedding model (they published details in 2022) turns songs into vectors based on co-listening behavior — if millions of people listen to song A and song B back to back, they end up near each other. No audio analysis needed.

RAG (Retrieval-Augmented Generation) — the technique behind most “chat with your documents” products — works by embedding both the document chunks and the user’s question, then retrieving the most similar chunks. The LLM never reads your whole company wiki; it reads the three chunks that are closest to what you asked.

Classification and clustering become easier when you can measure distance between items. Spam detection, topic modeling, duplicate detection — all of these work better with embeddings than with raw text.

Misconception: Bigger Is Always Better

More dimensions ≠ better embeddings. A 3072-dimension OpenAI embedding isn’t automatically better than a 384-dimension one for every task.

What matters is how the embeddings were trained and on what data. A general-purpose embedding model might be mediocre at legal documents. A model fine-tuned on legal text with 384 dimensions might outperform it easily.

There’s also a practical concern: storing and searching billions of high-dimensional vectors is expensive. Entire companies (Pinecone, Weaviate, Qdrant) exist to solve this problem efficiently. It’s called a vector database, and it’s one of the hotter infrastructure categories of the mid-2020s.

The Bias Problem

Embeddings learn from human language, which means they inherit human biases. The “doctor - man + woman ≈ nurse” example above isn’t a cute quirk — it reflects actual patterns in training corpora where nurses were statistically more associated with women.

A 2016 paper called “Man is to Computer Programmer as Woman is to Homemaker?” (Bolukbasi et al.) documented these biases systematically and proposed debiasing techniques. The debate about whether you can effectively debias embeddings — and whether debiasing makes them less accurate — is still active.

One Thing to Remember

Embeddings turn meaning into geometry. Once your data is a point in space, you can use distance to measure similarity — and that’s the foundation of modern search, recommendations, and almost every system that needs to understand language.

techaiembeddingsnlpvectorssemantic-search

See Also

  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
  • Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
  • Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
  • Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
  • Generative Ai Generative AI doesn't look things up — it makes things up. Here's why that's either impressive or terrifying, depending on what you ask it to make.