Word Embeddings and Word2Vec — Core Concepts
Word embeddings are dense vector representations of words where geometric relationships between vectors capture semantic relationships between words. Word2Vec, published by Tomas Mikolov at Google in 2013, popularized this approach and remains one of the most widely used embedding methods.
Why Not One-Hot Encoding?
The naive approach assigns each word a vector with a single 1 and all other positions 0. With a vocabulary of 100,000 words, every vector has 100,000 dimensions. Two problems:
- Sparsity — vectors are almost entirely zeros, wasting memory and compute.
- No similarity signal — the distance between any two one-hot vectors is identical, so “cat” is equally far from “kitten” and “democracy.”
Word embeddings compress each word into 100-300 dimensions where distances encode meaning.
How Word2Vec Learns
Word2Vec trains a shallow neural network on a simple task: predict a word from its context (or vice versa). The word vectors are not the output — they are the internal weights of the network, which encode useful patterns as a byproduct of training.
CBOW (Continuous Bag of Words)
Given surrounding context words, predict the center word.
Input: “The ___ sat on the mat” → Predict: “cat”
CBOW averages the context word vectors and passes them through a single hidden layer to predict the target word. It works well with frequent words and trains faster.
Skip-gram
Given a center word, predict the surrounding context words.
Input: “cat” → Predict: “The,” “sat,” “on,” “the,” “mat”
Skip-gram works better with rare words and small datasets because it creates more training examples from each sentence.
Window Size
The window parameter controls how many words around the target count as “context.” A window of 2 means the two words before and after the target.
- Small window (2-3): captures syntactic relationships (adjective-noun, verb-adverb).
- Large window (5-10): captures semantic/topical relationships (words in the same domain).
Key Properties of Word Vectors
Similarity
Words that appear in similar contexts get similar vectors. Cosine similarity measures how close two vectors are:
- cos(“dog”, “puppy”) ≈ 0.85
- cos(“dog”, “computer”) ≈ 0.15
Analogies
Vector arithmetic captures relational patterns:
- king - man + woman ≈ queen
- Paris - France + Germany ≈ Berlin
- walked - walk + swim ≈ swam
These work because the model encodes concepts like gender, geography, and tense as consistent directional shifts in the vector space.
Clustering
Words naturally cluster by category. Plotting vectors with dimensionality reduction (t-SNE or UMAP) reveals that countries group together, colors group together, and professions group together — without anyone specifying these categories.
Pre-trained vs. Custom Embeddings
Pre-trained (Google News Word2Vec, GloVe, FastText):
- Trained on billions of words.
- Good general-purpose vectors.
- Free and immediate to use.
- May not cover domain-specific vocabulary (medical terms, product codes).
Custom-trained:
- Trained on your specific corpus.
- Captures domain-specific meaning (“positive” means something different in medical vs. movie contexts).
- Requires enough text (typically 10M+ words for decent quality).
- Can be initialized from pre-trained vectors and fine-tuned.
Beyond Word2Vec
Word2Vec has one major limitation: each word gets exactly one vector regardless of context. “Bank” means the same thing whether it is a river bank or a financial bank.
Later methods address this:
- GloVe (2014) — uses global word co-occurrence statistics instead of a sliding window. Similar quality, different training approach.
- FastText (2016) — represents words as sums of character n-grams. Can generate vectors for words not seen during training (misspellings, rare words).
- ELMo (2018) — context-dependent embeddings from a bidirectional LSTM.
- BERT (2018) — fully contextualized embeddings from a transformer. “Bank” gets different vectors in different sentences.
For most practical applications today, contextual embeddings (BERT and descendants) have replaced static embeddings. But Word2Vec remains valuable for understanding how embeddings work and for applications where speed and simplicity matter.
Common Misunderstanding
People sometimes think bigger embedding dimensions are always better. In practice, 100-300 dimensions capture most useful information. Going to 1,000 dimensions overfits on small corpora and adds computation cost with diminishing returns.
The one thing to remember: Word2Vec learns word vectors by predicting context, producing a space where similar words are close and relationships are encoded as directions — a foundational idea that powers all modern NLP.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.