Word Embeddings and Word2Vec — Core Concepts

Understand how Word2Vec learns word vectors, what Skip-gram and CBOW mean, and when embeddings improve your NLP pipeline.

Word embeddings are dense vector representations of words where geometric relationships between vectors capture semantic relationships between words. Word2Vec, published by Tomas Mikolov at Google in 2013, popularized this approach and remains one of the most widely used embedding methods.

Why Not One-Hot Encoding?

The naive approach assigns each word a vector with a single 1 and all other positions 0. With a vocabulary of 100,000 words, every vector has 100,000 dimensions. Two problems:

Sparsity — vectors are almost entirely zeros, wasting memory and compute.
No similarity signal — the distance between any two one-hot vectors is identical, so “cat” is equally far from “kitten” and “democracy.”

Word embeddings compress each word into 100-300 dimensions where distances encode meaning.

How Word2Vec Learns

Word2Vec trains a shallow neural network on a simple task: predict a word from its context (or vice versa). The word vectors are not the output — they are the internal weights of the network, which encode useful patterns as a byproduct of training.

CBOW (Continuous Bag of Words)

Given surrounding context words, predict the center word.

Input: “The ___ sat on the mat” → Predict: “cat”

CBOW averages the context word vectors and passes them through a single hidden layer to predict the target word. It works well with frequent words and trains faster.

Skip-gram

Given a center word, predict the surrounding context words.

Input: “cat” → Predict: “The,” “sat,” “on,” “the,” “mat”

Skip-gram works better with rare words and small datasets because it creates more training examples from each sentence.

Window Size

The window parameter controls how many words around the target count as “context.” A window of 2 means the two words before and after the target.

Small window (2-3): captures syntactic relationships (adjective-noun, verb-adverb).
Large window (5-10): captures semantic/topical relationships (words in the same domain).

Key Properties of Word Vectors

Similarity

Words that appear in similar contexts get similar vectors. Cosine similarity measures how close two vectors are:

cos(“dog”, “puppy”) ≈ 0.85
cos(“dog”, “computer”) ≈ 0.15

Analogies

Vector arithmetic captures relational patterns:

king - man + woman ≈ queen
Paris - France + Germany ≈ Berlin
walked - walk + swim ≈ swam

These work because the model encodes concepts like gender, geography, and tense as consistent directional shifts in the vector space.

Clustering

Words naturally cluster by category. Plotting vectors with dimensionality reduction (t-SNE or UMAP) reveals that countries group together, colors group together, and professions group together — without anyone specifying these categories.

Pre-trained vs. Custom Embeddings

Pre-trained (Google News Word2Vec, GloVe, FastText):

Trained on billions of words.
Good general-purpose vectors.
Free and immediate to use.
May not cover domain-specific vocabulary (medical terms, product codes).

Custom-trained:

Trained on your specific corpus.
Captures domain-specific meaning (“positive” means something different in medical vs. movie contexts).
Requires enough text (typically 10M+ words for decent quality).
Can be initialized from pre-trained vectors and fine-tuned.

Beyond Word2Vec

Word2Vec has one major limitation: each word gets exactly one vector regardless of context. “Bank” means the same thing whether it is a river bank or a financial bank.

Later methods address this:

GloVe (2014) — uses global word co-occurrence statistics instead of a sliding window. Similar quality, different training approach.
FastText (2016) — represents words as sums of character n-grams. Can generate vectors for words not seen during training (misspellings, rare words).
ELMo (2018) — context-dependent embeddings from a bidirectional LSTM.
BERT (2018) — fully contextualized embeddings from a transformer. “Bank” gets different vectors in different sentences.

For most practical applications today, contextual embeddings (BERT and descendants) have replaced static embeddings. But Word2Vec remains valuable for understanding how embeddings work and for applications where speed and simplicity matter.

Common Misunderstanding

People sometimes think bigger embedding dimensions are always better. In practice, 100-300 dimensions capture most useful information. Going to 1,000 dimensions overfits on small corpora and adds computation cost with diminishing returns.

The one thing to remember: Word2Vec learns word vectors by predicting context, producing a space where similar words are close and relationships are encoded as directions — a foundational idea that powers all modern NLP.

pythonword2vecword-embeddingsnlp