Word Embeddings and Word2Vec — Deep Dive

Word2Vec is both a foundational concept in NLP and a practical tool still used in production. This guide covers training, evaluating, and applying word embeddings with Python.

Training Word2Vec with Gensim

Basic Training

from gensim.models import Word2Vec

# sentences: list of lists of tokens
sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "played", "in", "the", "yard"],
    # ... thousands more sentences
]

model = Word2Vec(
    sentences,
    vector_size=200,    # embedding dimensions
    window=5,           # context window size
    min_count=5,        # ignore words appearing fewer than 5 times
    workers=4,          # parallel training threads
    sg=1,               # 1 = Skip-gram, 0 = CBOW
    epochs=10,          # training passes over the corpus
    negative=10,        # negative samples per positive example
    sample=1e-4,        # downsample frequent words
)

Streaming Large Corpora

For datasets that do not fit in memory:

class SentenceIterator:
    def __init__(self, filepath):
        self.filepath = filepath

    def __iter__(self):
        with open(self.filepath, 'r') as f:
            for line in f:
                yield line.strip().split()

sentences = SentenceIterator("corpus.txt")
model = Word2Vec(sentences, vector_size=200, window=5, min_count=5, workers=4, sg=1)

Gensim makes two passes: one to build the vocabulary, one to train. The iterator gets called twice.

Key Hyperparameters Explained

vector_size: 100-300 is standard. Google’s original Word2Vec used 300. For domain-specific models on smaller corpora, 100-150 avoids overfitting.

window: Controls what counts as “context.” Smaller windows (2-3) learn syntactic relationships. Larger windows (5-10) learn topical/semantic relationships.

negative: Negative sampling approximates the full softmax for training efficiency. 5-20 is typical. Larger values improve quality on large corpora but slow training.

sample: Downsampling threshold for frequent words. Words with frequency above this threshold get randomly dropped during training. This prevents common words like “the” from dominating the training signal. Values between 1e-3 and 1e-5 work well.

min_count: Words below this frequency are dropped from the vocabulary. Rare words have poor vector quality anyway, and removing them reduces model size.

Loading Pre-trained Models

Google News Vectors

from gensim.models import KeyedVectors

# Download from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM
wv = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
print(wv.most_similar("python", topn=5))

GloVe Vectors

GloVe vectors need conversion to Word2Vec format:

from gensim.scripts.glove2word2vec import glove2word2vec

glove2word2vec("glove.6B.200d.txt", "glove.6B.200d.w2v.txt")
wv = KeyedVectors.load_word2vec_format("glove.6B.200d.w2v.txt", no_header=True)

FastText Vectors

from gensim.models.fasttext import load_facebook_vectors

ft = load_facebook_vectors("cc.en.300.bin")
# FastText can generate vectors for OOV words
print(ft["misspeling"])  # works despite the typo

Evaluating Embeddings

Analogy Tests

# "king" - "man" + "woman" = ?
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3)
# [('queen', 0.85), ('princess', 0.73), ('monarch', 0.71)]

Gensim includes standard analogy test sets:

from gensim.models import KeyedVectors

wv = model.wv
accuracy = wv.evaluate_word_analogies("questions-words.txt")
# Returns per-section accuracy (capital-country, family, verb-tense, etc.)

The Google analogy test set contains 19,544 questions across categories. A well-trained 300-dimensional model on a large corpus typically achieves 60-70% accuracy.

Similarity Benchmarks

from scipy.stats import spearmanr

# WordSim-353 or SimLex-999 datasets
# Each contains word pairs with human similarity ratings
human_scores = []
model_scores = []

for word1, word2, human_score in benchmark_pairs:
    if word1 in model.wv and word2 in model.wv:
        model_scores.append(model.wv.similarity(word1, word2))
        human_scores.append(human_score)

correlation, pvalue = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.3f}")

Correlations above 0.65 on SimLex-999 indicate good semantic quality.

Practical Applications

Document Similarity with Averaged Vectors

import numpy as np

def document_vector(model, tokens):
    """Average word vectors for a document."""
    vectors = [model.wv[w] for w in tokens if w in model.wv]
    if not vectors:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

doc1_vec = document_vector(model, ["machine", "learning", "neural", "network"])
doc2_vec = document_vector(model, ["deep", "learning", "artificial", "intelligence"])

from numpy.linalg import norm
similarity = np.dot(doc1_vec, doc2_vec) / (norm(doc1_vec) * norm(doc2_vec))
print(f"Similarity: {similarity:.3f}")  # ~0.85

Feature Input for Classifiers

from sklearn.linear_model import LogisticRegression
import numpy as np

# Convert each document to a fixed-size vector
X = np.array([document_vector(model, doc) for doc in tokenized_docs])
y = labels

clf = LogisticRegression(max_iter=1000)
clf.fit(X, y)

This approach is simple but competitive. On many classification tasks, averaged Word2Vec vectors + Logistic Regression performs within a few percent of TF-IDF approaches while using much smaller feature vectors (300 vs. 50,000+ dimensions).

Finding Domain-Specific Terms

# Train on your domain corpus, then find words similar to seed terms
domain_model = Word2Vec(domain_sentences, vector_size=200, window=5, min_count=3, sg=1)

# Expand a seed list of technical terms
seed = "kubernetes"
related = domain_model.wv.most_similar(seed, topn=20)
for word, score in related:
    print(f"{word}: {score:.3f}")
# docker: 0.82, helm: 0.78, pod: 0.75, ...

For large-scale similarity search, use approximate nearest neighbor libraries:

import numpy as np
from annoy import AnnoyIndex

vector_size = model.vector_size
index = AnnoyIndex(vector_size, 'angular')

words = list(model.wv.key_to_index.keys())
for i, word in enumerate(words):
    index.add_item(i, model.wv[word])

index.build(10)  # 10 trees

# Query
query_idx = words.index("python")
neighbors = index.get_nns_by_item(query_idx, 10, include_distances=True)
for idx, dist in zip(*neighbors):
    print(f"{words[idx]}: {dist:.3f}")

Annoy provides sub-millisecond queries even with millions of vectors.

Updating Models with New Data

# Incremental training (add vocabulary and continue training)
new_sentences = [["new", "domain", "specific", "terms"], ...]

model.build_vocab(new_sentences, update=True)
model.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)

This works for adding new vocabulary but can shift existing vectors. For major domain changes, training from scratch is safer.

Saving and Loading

# Full model (can continue training)
model.save("word2vec.model")
loaded = Word2Vec.load("word2vec.model")

# Vectors only (smaller, read-only)
model.wv.save("word2vec.kv")
wv = KeyedVectors.load("word2vec.kv")

# Word2Vec text format (interoperable with other tools)
model.wv.save_word2vec_format("vectors.txt", binary=False)

Performance Benchmarks

Training benchmarks on a 4-core CPU:

Corpus SizeVocabTraining TimeModel Size
10M words50k~2 min40 MB
100M words200k~20 min160 MB
1B words500k~3 hours400 MB

Memory usage during training is approximately 3× the final model size due to the neural network weights.

Word2Vec vs. Modern Alternatives

FeatureWord2VecGloVeFastTextBERT
Context-awareNoNoNoYes
OOV handlingNoNoYes (subwords)Yes (subwords)
Training speedFastFastFastVery slow
Inference speedInstantInstantInstantSlow
Embedding size100-300100-300100-300768-1024
Best forSimilarity, clusteringSimilar to Word2VecMorphologically rich languagesAll NLP tasks

Common Pitfalls

  1. Training on too little data. Word2Vec needs at least 1M words for reasonable quality. Below that, use pre-trained vectors.
  2. Ignoring preprocessing. Lowercasing, removing punctuation, and handling contractions directly affect vector quality. “Don’t” and “dont” should be normalized.
  3. Using raw cosine similarity as a classification feature. Similarity scores are useful for ranking but poorly calibrated for thresholding. Learn a threshold from labeled data.
  4. Averaging vectors for long documents. Averaging dilutes the signal. For documents over 500 words, consider TF-IDF-weighted averaging or Doc2Vec.
  5. Assuming Word2Vec captures all meaning. Static embeddings encode one sense per word. If polysemy matters (bank = financial vs. river), use contextual embeddings.

The one thing to remember: Word2Vec maps words to dense vectors where geometry encodes meaning — train on your domain for best results, use pre-trained vectors for quick starts, and graduate to contextual models when you need words to mean different things in different sentences.

pythonword2vecword-embeddingsnlp

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.