Biopython for Bioinformatics — Deep Dive
Architecture of Biopython
Biopython is organized into loosely coupled sub-packages. The design philosophy favors thin wrappers around external tools rather than reimplementing algorithms in pure Python. This keeps the library maintainable and lets researchers benefit from decades of optimized C and Fortran code in tools like BLAST, HMMER, and MUSCLE.
Key architectural decisions:
- SeqRecord as the lingua franca. Nearly every module produces or consumes
SeqRecordobjects. This consistency means you can chain parsing, alignment, and tree-building without data conversion boilerplate. - Lazy parsing where possible.
SeqIO.parse()returns a generator, so you can iterate through a 50 GB FASTQ file without loading it into memory. - External tool wrappers follow a pattern: write input to a temp file, invoke the binary via
subprocess, parse the output file. TheBio.Blast.Applicationsmodule andBio.Align.Applicationsmodule both follow this template.
Working with sequences in depth
Seq objects and the genetic code
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Transcription: DNA → mRNA
mrna = dna.transcribe()
# Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
# Translation: mRNA → protein
protein = mrna.translate()
# Seq('MAIVMGR*KGAR*')
# Direct DNA → protein (skips explicit transcription)
protein_direct = dna.translate()
# Using an alternative genetic code (e.g., mitochondrial)
mito_protein = dna.translate(table="Vertebrate Mitochondrial")
Biopython ships with all NCBI translation tables. The table parameter accepts either a name string or an integer ID.
SeqRecord metadata
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
record = SeqRecord(
Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"),
id="EXAMPLE_001",
name="example_gene",
description="Hypothetical gene for demonstration",
)
# Add a coding sequence feature
cds = SeqFeature(
FeatureLocation(start=0, end=39),
type="CDS",
qualifiers={"gene": ["exampleA"], "product": ["hypothetical protein"]},
)
record.features.append(cds)
Features use BioPython’s location model, which handles complements, joins (for introns), and fuzzy positions (e.g., “somewhere before position 100”).
Parsing real-world files
FASTA — the universal sequence format
from Bio import SeqIO
# Parse a multi-FASTA file
records = list(SeqIO.parse("sequences.fasta", "fasta"))
print(f"Loaded {len(records)} sequences")
# Filter by length
long_seqs = [r for r in records if len(r.seq) > 1000]
# Write filtered sequences to a new file
SeqIO.write(long_seqs, "long_sequences.fasta", "fasta")
GenBank — richly annotated entries
for record in SeqIO.parse("genome.gbk", "genbank"):
for feature in record.features:
if feature.type == "CDS":
gene_name = feature.qualifiers.get("gene", ["unknown"])[0]
location = feature.location
print(f"{gene_name}: {location}")
FASTQ — sequencing reads with quality scores
from Bio import SeqIO
# Quality filtering: keep reads where mean Phred score >= 30
def mean_quality(record):
scores = record.letter_annotations["phred_quality"]
return sum(scores) / len(scores)
good_reads = (r for r in SeqIO.parse("reads.fastq", "fastq")
if mean_quality(r) >= 30)
count = SeqIO.write(good_reads, "filtered.fastq", "fastq")
print(f"Kept {count} high-quality reads")
The generator approach means this script handles files larger than available RAM.
NCBI Entrez integration
Searching and downloading
from Bio import Entrez
Entrez.email = "researcher@university.edu" # NCBI requires this
# Search for human insulin gene
handle = Entrez.esearch(db="nucleotide", term="Homo sapiens[Orgn] AND insulin[Gene]", retmax=5)
results = Entrez.read(handle)
handle.close()
ids = results["IdList"]
print(f"Found {results['Count']} results, fetched top {len(ids)}")
# Download sequences
handle = Entrez.efetch(db="nucleotide", id=ids, rettype="gb", retmode="text")
records = list(SeqIO.parse(handle, "genbank"))
handle.close()
Rate limiting and best practices
NCBI enforces a limit of 3 requests per second (10 with an API key). Biopython does not throttle automatically. In production scripts, add explicit delays:
import time
for gene_id in gene_ids:
handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
handle.close()
process(record)
time.sleep(0.34) # Stay under 3 req/s
Register for an NCBI API key at ncbi.nlm.nih.gov/account to raise the limit to 10 req/s.
Running BLAST from Python
Local BLAST
from Bio.Blast.Applications import NcbiblastnCommandline
blastn = NcbiblastnCommandline(
query="query.fasta",
db="nt",
evalue=1e-10,
outfmt=5, # XML output
out="results.xml",
num_threads=4,
)
stdout, stderr = blastn()
Parsing BLAST results
from Bio.Blast import NCBIXML
with open("results.xml") as f:
blast_records = NCBIXML.parse(f)
for record in blast_records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-20:
print(f"Match: {alignment.title[:80]}")
print(f" Score: {hsp.score}, E-value: {hsp.expect}")
print(f" Identities: {hsp.identities}/{hsp.align_length}")
Remote BLAST (NCBI servers)
from Bio.Blast import NCBIWWW, NCBIXML
result_handle = NCBIWWW.qblast("blastn", "nt", "ATGGCCATTGTAATGGGCCGCTGA")
blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments[:5]:
print(f"{alignment.title[:100]}")
print(f" E-value: {alignment.hsps[0].expect}")
Remote BLAST can take minutes. For batch queries, prefer local BLAST with a downloaded database.
Multiple sequence alignment
from Bio.Align.Applications import ClustalOmegaCommandline
from Bio import AlignIO
# Run Clustal Omega
clustalo = ClustalOmegaCommandline(
infile="sequences.fasta",
outfile="aligned.fasta",
verbose=True,
auto=True,
)
stdout, stderr = clustalo()
# Parse the alignment
alignment = AlignIO.read("aligned.fasta", "fasta")
print(f"Alignment of {len(alignment)} sequences, length {alignment.get_alignment_length()}")
# Calculate conservation at each position
from collections import Counter
for i in range(alignment.get_alignment_length()):
column = alignment[:, i]
counts = Counter(column)
most_common, freq = counts.most_common(1)[0]
conservation = freq / len(alignment)
if conservation < 0.5:
print(f"Position {i}: variable (most common: {most_common}, {conservation:.0%})")
Phylogenetics with Bio.Phylo
from Bio import Phylo
import matplotlib.pyplot as plt
# Read a Newick tree
tree = Phylo.read("tree.nwk", "newick")
# Traverse the tree
for clade in tree.find_clades():
if clade.name:
print(f"Leaf: {clade.name}, branch length: {clade.branch_length}")
# Find the distance between two taxa
distance = tree.distance("Species_A", "Species_B")
print(f"Evolutionary distance: {distance:.4f}")
# Render the tree
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax, do_show=False)
plt.savefig("phylo_tree.png", dpi=150, bbox_inches="tight")
Real-world pipeline: variant calling prep
A practical example combining multiple Biopython modules — downloading a reference gene, aligning sample reads, and identifying mismatches:
from Bio import Entrez, SeqIO
from Bio.Seq import Seq
import subprocess
import time
Entrez.email = "researcher@university.edu"
# 1. Download reference sequence
handle = Entrez.efetch(db="nucleotide", id="NM_000546.6", rettype="fasta", retmode="text")
reference = SeqIO.read(handle, "fasta")
handle.close()
SeqIO.write(reference, "reference.fasta", "fasta")
# 2. Quality-filter sample reads
good_reads = (r for r in SeqIO.parse("sample_reads.fastq", "fastq")
if sum(r.letter_annotations["phred_quality"]) / len(r) >= 25)
SeqIO.write(good_reads, "filtered_reads.fastq", "fastq")
# 3. Align with minimap2 (external tool)
subprocess.run([
"minimap2", "-a", "reference.fasta", "filtered_reads.fastq",
"-o", "aligned.sam"
], check=True)
print("Pipeline complete: reference downloaded, reads filtered, alignment done.")
Performance considerations and tradeoffs
| Scenario | Biopython approach | Better alternative |
|---|---|---|
| Parse 10 GB FASTQ | SeqIO.parse() generator | pysam or HTSeq for indexed access |
| Align 100k sequences | Wrapper around Clustal/MUSCLE | Direct MUSCLE CLI with multiprocessing |
| Search 1M sequences | Iterate with SeqIO | Build a BLAST database, use local BLAST |
| Structural biology | Bio.PDB module | MDAnalysis or PyMOL for complex visualization |
Biopython is optimized for correctness and ease of use, not raw throughput. For performance-critical inner loops, consider pysam (C-backed BAM/SAM parsing), scikit-bio (NumPy-accelerated), or direct C tool invocations.
Testing bioinformatics code
import pytest
from Bio.Seq import Seq
def test_translation_standard_code():
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()
assert protein.startswith("M") # Methionine start
assert "*" in str(protein) # Contains stop codon
def test_reverse_complement():
dna = Seq("AGTC")
assert str(dna.reverse_complement()) == "GACT"
def test_gc_content():
from Bio.SeqUtils import gc_fraction
seq = Seq("ATGCATGC")
gc = gc_fraction(seq)
assert 0.49 < gc < 0.51 # 50% GC content
The one thing to remember: Biopython’s real power is as pipeline glue — it connects databases, file formats, and external analysis tools through a consistent Python API, letting you focus on the science instead of data wrangling.
See Also
- Python Clinical Trial Analysis How Python helps scientists figure out whether a new medicine actually works by crunching the numbers from clinical trials.
- Python Drug Interaction Modeling How Python helps scientists figure out which medicines are safe to take together and which combinations could be dangerous.
- Python Genomics Sequencing How Python helps scientists read and understand the instruction manual written inside every cell of your body.
- Python Medical Image Analysis How Python helps doctors see inside your body more clearly by teaching computers to read X-rays, MRIs, and CT scans.
- Python Pandemic Modeling How Python helps scientists predict the spread of diseases like COVID-19 and plan the best ways to slow them down.