Word Embeddings and Word2Vec — Deep Dive
Word2Vec is both a foundational concept in NLP and a practical tool still used in production. This guide covers training, evaluating, and applying word embeddings with Python.
Training Word2Vec with Gensim
Basic Training
from gensim.models import Word2Vec
# sentences: list of lists of tokens
sentences = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "played", "in", "the", "yard"],
# ... thousands more sentences
]
model = Word2Vec(
sentences,
vector_size=200, # embedding dimensions
window=5, # context window size
min_count=5, # ignore words appearing fewer than 5 times
workers=4, # parallel training threads
sg=1, # 1 = Skip-gram, 0 = CBOW
epochs=10, # training passes over the corpus
negative=10, # negative samples per positive example
sample=1e-4, # downsample frequent words
)
Streaming Large Corpora
For datasets that do not fit in memory:
class SentenceIterator:
def __init__(self, filepath):
self.filepath = filepath
def __iter__(self):
with open(self.filepath, 'r') as f:
for line in f:
yield line.strip().split()
sentences = SentenceIterator("corpus.txt")
model = Word2Vec(sentences, vector_size=200, window=5, min_count=5, workers=4, sg=1)
Gensim makes two passes: one to build the vocabulary, one to train. The iterator gets called twice.
Key Hyperparameters Explained
vector_size: 100-300 is standard. Google’s original Word2Vec used 300. For domain-specific models on smaller corpora, 100-150 avoids overfitting.
window: Controls what counts as “context.” Smaller windows (2-3) learn syntactic relationships. Larger windows (5-10) learn topical/semantic relationships.
negative: Negative sampling approximates the full softmax for training efficiency. 5-20 is typical. Larger values improve quality on large corpora but slow training.
sample: Downsampling threshold for frequent words. Words with frequency above this threshold get randomly dropped during training. This prevents common words like “the” from dominating the training signal. Values between 1e-3 and 1e-5 work well.
min_count: Words below this frequency are dropped from the vocabulary. Rare words have poor vector quality anyway, and removing them reduces model size.
Loading Pre-trained Models
Google News Vectors
from gensim.models import KeyedVectors
# Download from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM
wv = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
print(wv.most_similar("python", topn=5))
GloVe Vectors
GloVe vectors need conversion to Word2Vec format:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec("glove.6B.200d.txt", "glove.6B.200d.w2v.txt")
wv = KeyedVectors.load_word2vec_format("glove.6B.200d.w2v.txt", no_header=True)
FastText Vectors
from gensim.models.fasttext import load_facebook_vectors
ft = load_facebook_vectors("cc.en.300.bin")
# FastText can generate vectors for OOV words
print(ft["misspeling"]) # works despite the typo
Evaluating Embeddings
Analogy Tests
# "king" - "man" + "woman" = ?
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3)
# [('queen', 0.85), ('princess', 0.73), ('monarch', 0.71)]
Gensim includes standard analogy test sets:
from gensim.models import KeyedVectors
wv = model.wv
accuracy = wv.evaluate_word_analogies("questions-words.txt")
# Returns per-section accuracy (capital-country, family, verb-tense, etc.)
The Google analogy test set contains 19,544 questions across categories. A well-trained 300-dimensional model on a large corpus typically achieves 60-70% accuracy.
Similarity Benchmarks
from scipy.stats import spearmanr
# WordSim-353 or SimLex-999 datasets
# Each contains word pairs with human similarity ratings
human_scores = []
model_scores = []
for word1, word2, human_score in benchmark_pairs:
if word1 in model.wv and word2 in model.wv:
model_scores.append(model.wv.similarity(word1, word2))
human_scores.append(human_score)
correlation, pvalue = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.3f}")
Correlations above 0.65 on SimLex-999 indicate good semantic quality.
Practical Applications
Document Similarity with Averaged Vectors
import numpy as np
def document_vector(model, tokens):
"""Average word vectors for a document."""
vectors = [model.wv[w] for w in tokens if w in model.wv]
if not vectors:
return np.zeros(model.vector_size)
return np.mean(vectors, axis=0)
doc1_vec = document_vector(model, ["machine", "learning", "neural", "network"])
doc2_vec = document_vector(model, ["deep", "learning", "artificial", "intelligence"])
from numpy.linalg import norm
similarity = np.dot(doc1_vec, doc2_vec) / (norm(doc1_vec) * norm(doc2_vec))
print(f"Similarity: {similarity:.3f}") # ~0.85
Feature Input for Classifiers
from sklearn.linear_model import LogisticRegression
import numpy as np
# Convert each document to a fixed-size vector
X = np.array([document_vector(model, doc) for doc in tokenized_docs])
y = labels
clf = LogisticRegression(max_iter=1000)
clf.fit(X, y)
This approach is simple but competitive. On many classification tasks, averaged Word2Vec vectors + Logistic Regression performs within a few percent of TF-IDF approaches while using much smaller feature vectors (300 vs. 50,000+ dimensions).
Finding Domain-Specific Terms
# Train on your domain corpus, then find words similar to seed terms
domain_model = Word2Vec(domain_sentences, vector_size=200, window=5, min_count=3, sg=1)
# Expand a seed list of technical terms
seed = "kubernetes"
related = domain_model.wv.most_similar(seed, topn=20)
for word, score in related:
print(f"{word}: {score:.3f}")
# docker: 0.82, helm: 0.78, pod: 0.75, ...
Nearest Neighbor Search
For large-scale similarity search, use approximate nearest neighbor libraries:
import numpy as np
from annoy import AnnoyIndex
vector_size = model.vector_size
index = AnnoyIndex(vector_size, 'angular')
words = list(model.wv.key_to_index.keys())
for i, word in enumerate(words):
index.add_item(i, model.wv[word])
index.build(10) # 10 trees
# Query
query_idx = words.index("python")
neighbors = index.get_nns_by_item(query_idx, 10, include_distances=True)
for idx, dist in zip(*neighbors):
print(f"{words[idx]}: {dist:.3f}")
Annoy provides sub-millisecond queries even with millions of vectors.
Updating Models with New Data
# Incremental training (add vocabulary and continue training)
new_sentences = [["new", "domain", "specific", "terms"], ...]
model.build_vocab(new_sentences, update=True)
model.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)
This works for adding new vocabulary but can shift existing vectors. For major domain changes, training from scratch is safer.
Saving and Loading
# Full model (can continue training)
model.save("word2vec.model")
loaded = Word2Vec.load("word2vec.model")
# Vectors only (smaller, read-only)
model.wv.save("word2vec.kv")
wv = KeyedVectors.load("word2vec.kv")
# Word2Vec text format (interoperable with other tools)
model.wv.save_word2vec_format("vectors.txt", binary=False)
Performance Benchmarks
Training benchmarks on a 4-core CPU:
| Corpus Size | Vocab | Training Time | Model Size |
|---|---|---|---|
| 10M words | 50k | ~2 min | 40 MB |
| 100M words | 200k | ~20 min | 160 MB |
| 1B words | 500k | ~3 hours | 400 MB |
Memory usage during training is approximately 3× the final model size due to the neural network weights.
Word2Vec vs. Modern Alternatives
| Feature | Word2Vec | GloVe | FastText | BERT |
|---|---|---|---|---|
| Context-aware | No | No | No | Yes |
| OOV handling | No | No | Yes (subwords) | Yes (subwords) |
| Training speed | Fast | Fast | Fast | Very slow |
| Inference speed | Instant | Instant | Instant | Slow |
| Embedding size | 100-300 | 100-300 | 100-300 | 768-1024 |
| Best for | Similarity, clustering | Similar to Word2Vec | Morphologically rich languages | All NLP tasks |
Common Pitfalls
- Training on too little data. Word2Vec needs at least 1M words for reasonable quality. Below that, use pre-trained vectors.
- Ignoring preprocessing. Lowercasing, removing punctuation, and handling contractions directly affect vector quality. “Don’t” and “dont” should be normalized.
- Using raw cosine similarity as a classification feature. Similarity scores are useful for ranking but poorly calibrated for thresholding. Learn a threshold from labeled data.
- Averaging vectors for long documents. Averaging dilutes the signal. For documents over 500 words, consider TF-IDF-weighted averaging or Doc2Vec.
- Assuming Word2Vec captures all meaning. Static embeddings encode one sense per word. If polysemy matters (bank = financial vs. river), use contextual embeddings.
The one thing to remember: Word2Vec maps words to dense vectors where geometry encodes meaning — train on your domain for best results, use pre-trained vectors for quick starts, and graduate to contextual models when you need words to mean different things in different sentences.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.