Text Summarization in Python — Deep Dive

Build extractive and abstractive summarization pipelines in Python — from TextRank implementations through fine-tuned BART models with ROUGE evaluation.

Summarization sits at the intersection of information retrieval and text generation. This guide covers practical implementations at both ends of the complexity spectrum.

Extractive Summarization

TextRank from Scratch

TextRank adapts Google’s PageRank algorithm to text. Each sentence is a node; edge weights represent sentence similarity.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

def textrank_summarize(text, num_sentences=3):
    sentences = nltk.sent_tokenize(text)
    if len(sentences) <= num_sentences:
        return text

    # Build similarity matrix
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)
    similarity_matrix = cosine_similarity(tfidf_matrix)

    # Apply PageRank
    scores = _pagerank(similarity_matrix)

    # Select top sentences in original order
    ranked_indices = np.argsort(scores)[::-1][:num_sentences]
    selected = sorted(ranked_indices)
    return " ".join(sentences[i] for i in selected)

def _pagerank(matrix, damping=0.85, max_iter=100, tol=1e-6):
    n = matrix.shape[0]
    scores = np.ones(n) / n
    for _ in range(max_iter):
        new_scores = (1 - damping) / n + damping * matrix.T @ (scores / matrix.sum(axis=1))
        if np.abs(new_scores - scores).sum() < tol:
            break
        scores = new_scores
    return scores

Using the sumy Library

sumy provides multiple extractive algorithms out of the box:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer

parser = PlaintextParser.from_string(text, Tokenizer("english"))

# TextRank
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)
print(" ".join(str(s) for s in summary))

# LSA (Latent Semantic Analysis) - often better for longer documents
summarizer = LsaSummarizer()
summary = summarizer(parser.document, sentences_count=3)

# LexRank - similar to TextRank but uses IDF-modified cosine similarity
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)

Improving Extractive Quality

Raw TextRank often selects redundant sentences. Two techniques help:

Maximal Marginal Relevance (MMR): Penalizes sentences similar to already-selected ones:

def mmr_select(similarity_matrix, query_scores, num_sentences=3, lambda_param=0.5):
    """Select sentences balancing relevance and diversity."""
    selected = []
    candidates = list(range(len(query_scores)))

    for _ in range(num_sentences):
        best_score = -1
        best_idx = -1
        for idx in candidates:
            relevance = query_scores[idx]
            redundancy = max(
                (similarity_matrix[idx][s] for s in selected), default=0
            )
            score = lambda_param * relevance - (1 - lambda_param) * redundancy
            if score > best_score:
                best_score = score
                best_idx = idx
        selected.append(best_idx)
        candidates.remove(best_idx)
    return selected

Position weighting: Boost sentences at the start of paragraphs:

def position_weighted_scores(sentences, base_scores, decay=0.95):
    """Apply position bias — earlier sentences get higher weight."""
    weights = np.array([decay ** i for i in range(len(sentences))])
    return base_scores * weights

Abstractive Summarization with Transformers

Using Pre-trained Models

from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0  # GPU; use -1 for CPU
)

article = """
The European Central Bank raised interest rates by 25 basis points on Thursday,
bringing the main refinancing rate to 4.5 percent, the highest level in over
two decades. ECB President Christine Lagarde signaled that further increases
are possible if inflation remains above the 2 percent target. The decision
was widely expected by financial markets, but the accompanying statement
suggested a more hawkish stance than many analysts had anticipated.
Core inflation, which excludes volatile food and energy prices, remained
at 5.3 percent in August, well above the ECB's target.
"""

result = summarizer(article, max_length=80, min_length=30, do_sample=False)
print(result[0]['summary_text'])

Model Comparison

models = {
    "bart-large-cnn": "facebook/bart-large-cnn",
    "t5-base": "t5-base",
    "pegasus-cnn": "google/pegasus-cnn_dailymail",
}

for name, model_id in models.items():
    pipe = pipeline("summarization", model=model_id, device=0)
    result = pipe(article, max_length=80, min_length=30)
    print(f"\n--- {name} ---")
    print(result[0]['summary_text'])

BART — strong all-rounder, best for news-style summarization.
T5 — flexible (prefix-based: “summarize: …”), good for multi-task setups.
Pegasus — pre-trained with gap-sentence generation, often best ROUGE scores on news benchmarks.

Handling Long Documents

Most transformer models have a 1,024-token input limit. For longer documents:

Chunking strategy:

from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0)

def summarize_long_document(text, max_chunk_tokens=900, summary_max_length=150):
    tokens = tokenizer.encode(text)

    # Split into overlapping chunks
    chunks = []
    for i in range(0, len(tokens), max_chunk_tokens):
        chunk_tokens = tokens[i:i + max_chunk_tokens]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        chunks.append(chunk_text)

    # Summarize each chunk
    chunk_summaries = []
    for chunk in chunks:
        result = summarizer(chunk, max_length=summary_max_length // len(chunks) + 50,
                          min_length=20, do_sample=False)
        chunk_summaries.append(result[0]['summary_text'])

    # Optionally: summarize the summaries (hierarchical)
    combined = " ".join(chunk_summaries)
    if len(tokenizer.encode(combined)) > max_chunk_tokens:
        final = summarizer(combined, max_length=summary_max_length, min_length=30, do_sample=False)
        return final[0]['summary_text']
    return combined

Long-context models: LED (Longformer Encoder-Decoder) handles up to 16,384 tokens:

summarizer = pipeline("summarization", model="allenai/led-base-16384", device=0)
result = summarizer(long_text, max_length=200, min_length=50)

Fine-tuning for Your Domain

from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
)
from datasets import Dataset

model_name = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Prepare data: list of {"document": ..., "summary": ...}
dataset = Dataset.from_dict({"document": documents, "summary": summaries})

def preprocess(examples):
    inputs = tokenizer(examples["document"], max_length=1024, truncation=True, padding="max_length")
    targets = tokenizer(examples["summary"], max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    return inputs

tokenized = dataset.map(preprocess, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="./summary_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=3e-5,
    warmup_ratio=0.1,
    predict_with_generate=True,
    generation_max_length=128,
    save_strategy="epoch",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)
trainer.train()

Evaluation with ROUGE

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

reference = "The ECB raised rates to 4.5 percent, the highest in two decades."
generated = "Interest rates were raised by the ECB to their highest level in over 20 years."

scores = scorer.score(reference, generated)
for metric, values in scores.items():
    print(f"{metric}: P={values.precision:.3f} R={values.recall:.3f} F1={values.fmeasure:.3f}")

Batch Evaluation

from rouge_score import rouge_scorer
import numpy as np

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

rouge1_f, rouge2_f, rougeL_f = [], [], []
for ref, gen in zip(references, generated_summaries):
    scores = scorer.score(ref, gen)
    rouge1_f.append(scores['rouge1'].fmeasure)
    rouge2_f.append(scores['rouge2'].fmeasure)
    rougeL_f.append(scores['rougeL'].fmeasure)

print(f"ROUGE-1 F1: {np.mean(rouge1_f):.3f}")
print(f"ROUGE-2 F1: {np.mean(rouge2_f):.3f}")
print(f"ROUGE-L F1: {np.mean(rougeL_f):.3f}")

Detecting and Reducing Hallucinations

Abstractive models sometimes generate facts not present in the source. Mitigation strategies:

def check_factual_consistency(source, summary, nli_model=None):
    """Use NLI to check if summary is entailed by the source."""
    from transformers import pipeline
    if nli_model is None:
        nli_model = pipeline("text-classification",
                           model="facebook/bart-large-mnli", device=0)

    # Check each summary sentence against the source
    summary_sentences = nltk.sent_tokenize(summary)
    results = []
    for sent in summary_sentences:
        result = nli_model(f"{source}</s></s>{sent}")
        results.append({
            "sentence": sent,
            "entailment_score": next(
                r['score'] for r in result if r['label'] == 'ENTAILMENT'
            )
        })
    return results

Sentences with entailment scores below 0.5 are likely hallucinated and should be flagged for review.

Production Architecture

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()

# Load both extractive and abstractive
extractive = None  # Use sumy at runtime
abstractive = pipeline("summarization", model="facebook/bart-large-cnn", device=0)

@app.post("/summarize")
def summarize(text: str, method: str = "abstractive", max_length: int = 130):
    if method == "extractive":
        from sumy.parsers.plaintext import PlaintextParser
        from sumy.nlp.tokenizers import Tokenizer
        from sumy.summarizers.text_rank import TextRankSummarizer
        parser = PlaintextParser.from_string(text, Tokenizer("english"))
        summarizer = TextRankSummarizer()
        sentences = summarizer(parser.document, sentences_count=3)
        return {"summary": " ".join(str(s) for s in sentences), "method": "extractive"}
    else:
        result = abstractive(text, max_length=max_length, min_length=30, do_sample=False)
        return {"summary": result[0]["summary_text"], "method": "abstractive"}

Benchmarks

Model	ROUGE-2 (CNN/DM)	Latency (1 article, GPU)	Memory
TextRank (extractive)	0.16	5 ms (CPU)	50 MB
LexRank (extractive)	0.17	8 ms (CPU)	50 MB
BART-large-CNN	0.21	200 ms	1.6 GB
Pegasus-CNN	0.22	250 ms	2.2 GB
T5-base	0.19	150 ms	900 MB
LED-base-16384	0.18	400 ms	1.4 GB

Common Pitfalls

Evaluating only with ROUGE. ROUGE measures word overlap, not factual accuracy. A hallucinated summary can score well on ROUGE if it uses similar vocabulary. Supplement with human evaluation or NLI-based factual consistency checks.
Ignoring input length limits. Silently truncating long documents loses information from the end. Use chunking or long-context models instead.
Using abstractive summarization for legal/medical text. Hallucination risk is unacceptable in these domains. Extractive summarization is safer, or use abstractive with mandatory human review.
Not deduplicating extractive output. TextRank can select sentences that make the same point differently. Apply MMR or similarity-based filtering.
Fine-tuning on mismatched data. A model fine-tuned on news summaries will produce news-style output even when given technical documents. Domain match between training and deployment data matters more than model size.

The one thing to remember: Start with extractive summarization for reliability and speed, add abstractive models when fluency matters, and always verify that generated summaries do not invent information — especially in high-stakes applications.

pythontext-summarizationnlptext-processing