Python for Protein Structure Prediction — Core Concepts
Why protein structure prediction matters
A protein’s function is determined by its 3D structure. Understanding structure enables:
- Drug design — finding molecules that fit into protein binding sites like keys into locks
- Disease understanding — seeing how mutations change protein shape and cause disease
- Enzyme engineering — redesigning proteins for industrial applications (biofuels, detergents, food production)
The gap between known sequences (~250 million in UniProt) and experimentally solved structures (~200,000 in the PDB) makes computational prediction essential.
The protein folding problem
Proteins are chains of amino acids (20 types). A typical protein has 300-500 amino acids. The chain folds into a specific 3D shape determined by the interactions between amino acids — hydrogen bonds, hydrophobic effects, electrostatic attractions.
The challenge: a 300-residue protein could theoretically adopt 10^143 different conformations (Levinthal’s paradox). Yet real proteins fold in milliseconds. Finding the correct shape computationally was called one of biology’s grand challenges.
AlphaFold2 — the breakthrough
DeepMind’s AlphaFold2 (2020) solved protein structure prediction for most single-domain proteins. Its architecture uses:
- Multiple Sequence Alignment (MSA) — finds evolutionary relatives of the target protein. Conserved positions across species reveal structural constraints.
- Evoformer — a transformer-based module that processes the MSA and pairwise distance information through 48 attention layers.
- Structure Module — converts the processed representations into 3D coordinates with confidence scores.
AlphaFold achieved median GDT (Global Distance Test) scores above 90 on CASP14, where 100 is a perfect prediction. Previous methods scored around 60.
Running AlphaFold predictions
# AlphaFold requires significant setup (databases, GPU)
# ColabFold simplifies access via Google Colab or local installation
from colabfold.batch import get_queries, run
from pathlib import Path
queries = get_queries("input_sequences.fasta")
run(
queries=queries,
result_dir=Path("predictions"),
num_models=5,
num_recycles=3,
model_type="alphafold2_ptm",
use_templates=True,
)
The output includes:
- PDB files — 3D coordinates of every atom
- pLDDT scores — per-residue confidence (>90 = high confidence, <50 = likely disordered)
- PAE (Predicted Aligned Error) — confidence in relative positions of residue pairs
ESMFold — faster alternative
Meta’s ESMFold uses a protein language model (ESM-2) instead of multiple sequence alignments, making predictions ~60x faster at the cost of some accuracy:
import torch
from esm import pretrained
model, alphabet = pretrained.esmfold_v1()
model = model.eval().cuda()
sequence = "MKFLILLFNILCLFPVLAADNHGVSMNAS..."
with torch.no_grad():
output = model.infer_pdb(sequence)
# Write PDB file
with open("prediction.pdb", "w") as f:
f.write(output)
ESMFold works well for rapid screening of large sequence databases where MSA computation would be prohibitive.
Analyzing structures with Python
Biopython for PDB parsing
from Bio.PDB import PDBParser, DSSP
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "prediction.pdb")
model = structure[0]
for chain in model:
for residue in chain:
if residue.id[0] == " ": # Skip heteroatoms
ca = residue["CA"] # C-alpha atom
print(f"{residue.get_resname()} {residue.id[1]}: {ca.get_vector()}")
PyMOL for visualization
import pymol
from pymol import cmd
cmd.load("prediction.pdb", "protein")
cmd.color("cyan", "protein")
cmd.show("cartoon", "protein")
# Color by confidence (pLDDT stored in B-factor column)
cmd.spectrum("b", "red_white_blue", "protein", minimum=0, maximum=100)
cmd.png("structure_confidence.png", width=1200, height=900, dpi=300)
Structural comparison (RMSD)
from Bio.PDB import Superimposer
sup = Superimposer()
# Get C-alpha atoms from both structures
fixed_atoms = [residue["CA"] for residue in fixed_chain if "CA" in residue]
moving_atoms = [residue["CA"] for residue in moving_chain if "CA" in residue]
sup.set_atoms(fixed_atoms, moving_atoms)
print(f"RMSD: {sup.rms:.2f} Å")
An RMSD below 2 Å generally indicates an accurate prediction for a protein domain.
From structure to drug discovery
Predicted structures enable virtual screening — computationally docking millions of small molecules into protein binding sites:
- Predict target protein structure with AlphaFold
- Identify binding pockets with fpocket or SiteMap
- Dock candidate molecules with AutoDock Vina (Python bindings via
vina) - Score and rank candidates for experimental testing
This pipeline, fully accessible through Python, has accelerated early-stage drug discovery from years to weeks for the computational phase.
Common misconception
“AlphaFold solved all of structural biology.” AlphaFold excels at single-domain, well-evolved proteins but struggles with protein complexes (improving with AlphaFold-Multimer), intrinsically disordered regions (which genuinely lack a fixed shape), and proteins with few evolutionary relatives. Membrane proteins and large multi-subunit complexes remain challenging. Experimental methods (cryo-EM, X-ray crystallography) are still essential.
Real-world impact
- AlphaFold DB provides predicted structures for 200+ million proteins — nearly every known protein sequence. Researchers download and analyze these using Python.
- Insilico Medicine used AlphaFold-predicted structures to identify a drug candidate for idiopathic pulmonary fibrosis, reaching clinical trials in record time.
- The Protein Data Bank saw a surge in computational structure depositions, with Python pipelines for bulk download, filtering, and analysis.
The one thing to remember: Python bridges the gap between protein sequence databases and 3D structural understanding through AlphaFold and ESMFold, enabling drug discovery, disease research, and enzyme engineering at a scale that was impossible before 2020.
See Also
- Python Biopython Bioinformatics How Python helps scientists read the instruction manual hidden inside every living thing's DNA.
- Python Clinical Trial Analysis How Python helps scientists figure out whether a new medicine actually works by crunching the numbers from clinical trials.
- Python Drug Interaction Modeling How Python helps scientists figure out which medicines are safe to take together and which combinations could be dangerous.
- Python Genomics Sequencing How Python helps scientists read and understand the instruction manual written inside every cell of your body.
- Python Medical Image Analysis How Python helps doctors see inside your body more clearly by teaching computers to read X-rays, MRIs, and CT scans.