Word Embeddings and Word2Vec — Deep Dive

Train, evaluate, and deploy Word2Vec models in Python using Gensim — with practical code for custom embeddings, analogy tests, and downstream tasks.

Word2Vec is both a foundational concept in NLP and a practical tool still used in production. This guide covers training, evaluating, and applying word embeddings with Python.

Training Word2Vec with Gensim

Basic Training

from gensim.models import Word2Vec

# sentences: list of lists of tokens
sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "played", "in", "the", "yard"],
    # ... thousands more sentences
]

model = Word2Vec(
    sentences,
    vector_size=200,    # embedding dimensions
    window=5,           # context window size
    min_count=5,        # ignore words appearing fewer than 5 times
    workers=4,          # parallel training threads
    sg=1,               # 1 = Skip-gram, 0 = CBOW
    epochs=10,          # training passes over the corpus
    negative=10,        # negative samples per positive example
    sample=1e-4,        # downsample frequent words
)

Streaming Large Corpora

For datasets that do not fit in memory:

class SentenceIterator:
    def __init__(self, filepath):
        self.filepath = filepath

    def __iter__(self):
        with open(self.filepath, 'r') as f:
            for line in f:
                yield line.strip().split()

sentences = SentenceIterator("corpus.txt")
model = Word2Vec(sentences, vector_size=200, window=5, min_count=5, workers=4, sg=1)

Gensim makes two passes: one to build the vocabulary, one to train. The iterator gets called twice.

Key Hyperparameters Explained

vector_size: 100-300 is standard. Google’s original Word2Vec used 300. For domain-specific models on smaller corpora, 100-150 avoids overfitting.

window: Controls what counts as “context.” Smaller windows (2-3) learn syntactic relationships. Larger windows (5-10) learn topical/semantic relationships.

negative: Negative sampling approximates the full softmax for training efficiency. 5-20 is typical. Larger values improve quality on large corpora but slow training.

sample: Downsampling threshold for frequent words. Words with frequency above this threshold get randomly dropped during training. This prevents common words like “the” from dominating the training signal. Values between 1e-3 and 1e-5 work well.

min_count: Words below this frequency are dropped from the vocabulary. Rare words have poor vector quality anyway, and removing them reduces model size.

Loading Pre-trained Models

Google News Vectors

from gensim.models import KeyedVectors

# Download from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM
wv = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
print(wv.most_similar("python", topn=5))

GloVe Vectors

GloVe vectors need conversion to Word2Vec format:

from gensim.scripts.glove2word2vec import glove2word2vec

glove2word2vec("glove.6B.200d.txt", "glove.6B.200d.w2v.txt")
wv = KeyedVectors.load_word2vec_format("glove.6B.200d.w2v.txt", no_header=True)

FastText Vectors

from gensim.models.fasttext import load_facebook_vectors

ft = load_facebook_vectors("cc.en.300.bin")
# FastText can generate vectors for OOV words
print(ft["misspeling"])  # works despite the typo

Evaluating Embeddings

Analogy Tests

# "king" - "man" + "woman" = ?
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3)
# [('queen', 0.85), ('princess', 0.73), ('monarch', 0.71)]

Gensim includes standard analogy test sets:

from gensim.models import KeyedVectors

wv = model.wv
accuracy = wv.evaluate_word_analogies("questions-words.txt")
# Returns per-section accuracy (capital-country, family, verb-tense, etc.)

The Google analogy test set contains 19,544 questions across categories. A well-trained 300-dimensional model on a large corpus typically achieves 60-70% accuracy.

Similarity Benchmarks

from scipy.stats import spearmanr

# WordSim-353 or SimLex-999 datasets
# Each contains word pairs with human similarity ratings
human_scores = []
model_scores = []

for word1, word2, human_score in benchmark_pairs:
    if word1 in model.wv and word2 in model.wv:
        model_scores.append(model.wv.similarity(word1, word2))
        human_scores.append(human_score)

correlation, pvalue = spearmanr(human_scores, model_scores)
print(f"Spearman correlation: {correlation:.3f}")

Correlations above 0.65 on SimLex-999 indicate good semantic quality.

Practical Applications

Document Similarity with Averaged Vectors

import numpy as np

def document_vector(model, tokens):
    """Average word vectors for a document."""
    vectors = [model.wv[w] for w in tokens if w in model.wv]
    if not vectors:
        return np.zeros(model.vector_size)
    return np.mean(vectors, axis=0)

doc1_vec = document_vector(model, ["machine", "learning", "neural", "network"])
doc2_vec = document_vector(model, ["deep", "learning", "artificial", "intelligence"])

from numpy.linalg import norm
similarity = np.dot(doc1_vec, doc2_vec) / (norm(doc1_vec) * norm(doc2_vec))
print(f"Similarity: {similarity:.3f}")  # ~0.85

Feature Input for Classifiers

from sklearn.linear_model import LogisticRegression
import numpy as np

# Convert each document to a fixed-size vector
X = np.array([document_vector(model, doc) for doc in tokenized_docs])
y = labels

clf = LogisticRegression(max_iter=1000)
clf.fit(X, y)

This approach is simple but competitive. On many classification tasks, averaged Word2Vec vectors + Logistic Regression performs within a few percent of TF-IDF approaches while using much smaller feature vectors (300 vs. 50,000+ dimensions).

Finding Domain-Specific Terms

# Train on your domain corpus, then find words similar to seed terms
domain_model = Word2Vec(domain_sentences, vector_size=200, window=5, min_count=3, sg=1)

# Expand a seed list of technical terms
seed = "kubernetes"
related = domain_model.wv.most_similar(seed, topn=20)
for word, score in related:
    print(f"{word}: {score:.3f}")
# docker: 0.82, helm: 0.78, pod: 0.75, ...

Nearest Neighbor Search

For large-scale similarity search, use approximate nearest neighbor libraries:

import numpy as np
from annoy import AnnoyIndex

vector_size = model.vector_size
index = AnnoyIndex(vector_size, 'angular')

words = list(model.wv.key_to_index.keys())
for i, word in enumerate(words):
    index.add_item(i, model.wv[word])

index.build(10)  # 10 trees

# Query
query_idx = words.index("python")
neighbors = index.get_nns_by_item(query_idx, 10, include_distances=True)
for idx, dist in zip(*neighbors):
    print(f"{words[idx]}: {dist:.3f}")

Annoy provides sub-millisecond queries even with millions of vectors.

Updating Models with New Data

# Incremental training (add vocabulary and continue training)
new_sentences = [["new", "domain", "specific", "terms"], ...]

model.build_vocab(new_sentences, update=True)
model.train(new_sentences, total_examples=len(new_sentences), epochs=model.epochs)

This works for adding new vocabulary but can shift existing vectors. For major domain changes, training from scratch is safer.

Saving and Loading

# Full model (can continue training)
model.save("word2vec.model")
loaded = Word2Vec.load("word2vec.model")

# Vectors only (smaller, read-only)
model.wv.save("word2vec.kv")
wv = KeyedVectors.load("word2vec.kv")

# Word2Vec text format (interoperable with other tools)
model.wv.save_word2vec_format("vectors.txt", binary=False)

Performance Benchmarks

Training benchmarks on a 4-core CPU:

Corpus Size	Vocab	Training Time	Model Size
10M words	50k	~2 min	40 MB
100M words	200k	~20 min	160 MB
1B words	500k	~3 hours	400 MB

Memory usage during training is approximately 3× the final model size due to the neural network weights.

Word2Vec vs. Modern Alternatives

Feature	Word2Vec	GloVe	FastText	BERT
Context-aware	No	No	No	Yes
OOV handling	No	No	Yes (subwords)	Yes (subwords)
Training speed	Fast	Fast	Fast	Very slow
Inference speed	Instant	Instant	Instant	Slow
Embedding size	100-300	100-300	100-300	768-1024
Best for	Similarity, clustering	Similar to Word2Vec	Morphologically rich languages	All NLP tasks

Common Pitfalls

Training on too little data. Word2Vec needs at least 1M words for reasonable quality. Below that, use pre-trained vectors.
Ignoring preprocessing. Lowercasing, removing punctuation, and handling contractions directly affect vector quality. “Don’t” and “dont” should be normalized.
Using raw cosine similarity as a classification feature. Similarity scores are useful for ranking but poorly calibrated for thresholding. Learn a threshold from labeled data.
Averaging vectors for long documents. Averaging dilutes the signal. For documents over 500 words, consider TF-IDF-weighted averaging or Doc2Vec.
Assuming Word2Vec captures all meaning. Static embeddings encode one sense per word. If polysemy matters (bank = financial vs. river), use contextual embeddings.

The one thing to remember: Word2Vec maps words to dense vectors where geometry encodes meaning — train on your domain for best results, use pre-trained vectors for quick starts, and graduate to contextual models when you need words to mean different things in different sentences.

pythonword2vecword-embeddingsnlp