Python for Protein Structure Prediction — Deep Dive
Production structure prediction pipeline
A complete protein structure prediction system involves more than running AlphaFold. It chains sequence analysis, structure prediction, quality assessment, and downstream applications.
Sequence → MSA Generation (MMseqs2/HHblits) → Structure Prediction (AlphaFold/ESMFold) →
Quality Assessment (pLDDT/PAE) → Refinement → Docking/Design
AlphaFold2 architecture internals
Understanding the model helps interpret its outputs and failure modes.
Input processing
AlphaFold takes two inputs:
- MSA features — aligned sequences of evolutionary relatives, typically from UniRef90 and MGnify databases
- Template features — experimentally solved structures of homologs from PDB70
# Using ColabFold's MSA generation (faster than original AlphaFold pipeline)
from colabfold.mmseqs2 import get_msa
msa_results = get_msa(
query_sequence="MKFLILLFNILCLFPVLAADNHGVSMNAS...",
databases=["uniref30", "colabfold_envdb"],
use_pairing=False, # True for multimer predictions
)
# MSA depth affects prediction quality
print(f"MSA depth: {len(msa_results['msa'])}") # >100 preferred
Evoformer block
The Evoformer processes two representations simultaneously:
- MSA representation (N_seq × L × d) — information about evolutionary conservation
- Pair representation (L × L × d) — pairwise distance/angle information
These representations communicate through row attention, column attention, and outer product mean operations across 48 blocks. The pair representation converges toward the true distance map.
Structure module and recycling
The structure module converts pair representations into 3D coordinates using Invariant Point Attention (IPA) — a geometric attention mechanism that operates in the protein’s local reference frame:
# AlphaFold uses recycling: predictions feed back into the Evoformer
# Typical: 3 recycles, each refining the structure
# More recycles help difficult targets
config = {
"num_recycles": 12, # Up to 12 for challenging proteins
"recycle_early_stop_tolerance": 0.5, # Stop if change < 0.5Å
"num_models": 5, # Rank multiple model seeds
}
Confidence metrics — interpretation guide
pLDDT (predicted Local Distance Difference Test)
Per-residue confidence score (0-100) stored in the B-factor column:
from Bio.PDB import PDBParser
import numpy as np
parser = PDBParser(QUIET=True)
structure = parser.get_structure("pred", "ranked_0.pdb")
plddt_scores = []
for residue in structure[0]["A"]:
if "CA" in residue:
plddt_scores.append(residue["CA"].get_bfactor())
plddt = np.array(plddt_scores)
print(f"Mean pLDDT: {plddt.mean():.1f}")
print(f"Confident residues (>90): {(plddt > 90).sum()} / {len(plddt)}")
print(f"Disordered regions (<50): {(plddt < 50).sum()} / {len(plddt)}")
Interpretation:
- >90 — High confidence, backbone likely accurate
- 70-90 — Good confidence, overall fold correct
- 50-70 — Low confidence, use with caution
- <50 — Very low confidence, likely intrinsically disordered
PAE (Predicted Aligned Error)
A L×L matrix indicating confidence in the relative position of residue pairs:
import json
import matplotlib.pyplot as plt
with open("ranked_0_pae.json") as f:
pae_data = json.load(f)
pae_matrix = np.array(pae_data["predicted_aligned_error"])
plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap="bwr_r", vmin=0, vmax=30)
plt.colorbar(label="Expected position error (Å)")
plt.xlabel("Scored residue")
plt.ylabel("Aligned residue")
plt.title("Predicted Aligned Error")
plt.savefig("pae_plot.png", dpi=150)
Low PAE (blue) between two residue ranges indicates confident relative positioning — critical for identifying rigid domains vs flexible linkers.
Domain identification from PAE
from scipy.cluster.hierarchy import fcluster, linkage
from scipy.spatial.distance import squareform
def identify_domains(pae_matrix: np.ndarray, threshold: float = 12.0) -> np.ndarray:
"""Cluster residues into domains based on PAE matrix."""
# Convert PAE to distance matrix
pae_symmetric = (pae_matrix + pae_matrix.T) / 2
condensed = squareform(pae_symmetric, checks=False)
Z = linkage(condensed, method="average")
domains = fcluster(Z, t=threshold, criterion="distance")
return domains
domains = identify_domains(pae_matrix)
print(f"Found {len(set(domains))} domains")
for d in sorted(set(domains)):
residues = np.where(domains == d)[0]
print(f" Domain {d}: residues {residues[0]+1}-{residues[-1]+1}")
Multimer prediction with AlphaFold-Multimer
Predicting protein complexes requires special handling:
from colabfold.batch import run
# Separate chains with ":" in the sequence
complex_sequence = "CHAIN_A_SEQUENCE:CHAIN_B_SEQUENCE"
run(
queries=[("complex", complex_sequence, None)],
result_dir=Path("complex_prediction"),
model_type="alphafold2_multimer_v3",
num_models=5,
num_recycles=20, # Complexes benefit from more recycles
recycle_early_stop_tolerance=0.5,
)
Evaluate complex predictions by checking inter-chain PAE — low PAE between chains indicates a confident interface prediction.
Molecular docking with predicted structures
Binding site detection
from biopandas.pdb import PandasPdb
def find_pockets_fpocket(pdb_path: str) -> list[dict]:
"""Run fpocket and parse results."""
import subprocess
subprocess.run(["fpocket", "-f", pdb_path], check=True)
# Parse fpocket output
ppdb = PandasPdb().read_pdb(f"{pdb_path.replace('.pdb', '_out/pockets')}/pocket1_atm.pdb")
pocket_center = ppdb.df["ATOM"][["x_coord", "y_coord", "z_coord"]].mean()
return [{"center": pocket_center.values, "pdb": ppdb}]
Virtual screening with AutoDock Vina
from vina import Vina
def dock_compound(receptor_pdb: str, ligand_sdf: str,
center: tuple, box_size: tuple = (20, 20, 20)) -> dict:
v = Vina(sf_name="vina")
v.set_receptor(receptor_pdb)
v.set_ligand_from_file(ligand_sdf)
v.compute_vina_maps(center=center, box_size=box_size)
v.dock(exhaustiveness=32, n_poses=10)
energies = v.energies()
return {
"best_affinity": energies[0][0], # kcal/mol
"poses": v.poses(),
}
# Screen a library
results = []
for ligand_file in Path("library/").glob("*.sdf"):
result = dock_compound("target.pdbqt", str(ligand_file), center=(10.5, 23.1, -5.2))
results.append({"ligand": ligand_file.stem, "affinity": result["best_affinity"]})
results_df = pd.DataFrame(results).sort_values("affinity")
print(results_df.head(20)) # Top 20 candidates
Protein design with RFdiffusion
Beyond prediction, Python tools now design novel proteins. RFdiffusion (Baker Lab) generates protein backbones using diffusion models:
# RFdiffusion generates novel protein structures
# that fold into desired shapes
# Design a protein that binds a target
# Specify target structure and desired binding site
config = {
"inference": {
"input_pdb": "target_protein.pdb",
"num_designs": 100,
"design_startnum": 0,
},
"contigmap": {
"contigs": ["A10-40/0 B1-100/0"], # Design 10-40 residues binding to chain B
},
"diffuser": {
"T": 50, # Diffusion steps
}
}
After backbone generation, ProteinMPNN (also Python) designs amino acid sequences that fold into the generated backbone:
# ProteinMPNN: sequence design for a given backbone
from protein_mpnn_utils import ProteinMPNN
model = ProteinMPNN()
sequences = model.design(
pdb_path="designed_backbone.pdb",
num_sequences=8,
temperature=0.1, # Lower = more conservative
)
for seq in sequences:
print(f"Score: {seq.score:.2f}, Sequence: {seq.sequence[:50]}...")
The design cycle: RFdiffusion (backbone) → ProteinMPNN (sequence) → AlphaFold (validate fold) → experimental testing.
Benchmarking and validation
CASP-style evaluation
from Bio.PDB import Superimposer
from Bio.PDB.DSSP import DSSP
def evaluate_prediction(predicted_pdb: str, experimental_pdb: str) -> dict:
parser = PDBParser(QUIET=True)
pred = parser.get_structure("pred", predicted_pdb)
expt = parser.get_structure("expt", experimental_pdb)
# C-alpha RMSD
pred_ca = [r["CA"] for r in pred[0]["A"] if "CA" in r]
expt_ca = [r["CA"] for r in expt[0]["A"] if "CA" in r]
min_len = min(len(pred_ca), len(expt_ca))
sup = Superimposer()
sup.set_atoms(expt_ca[:min_len], pred_ca[:min_len])
# GDT-TS (Global Distance Test - Total Score)
distances = np.array([
(pred_ca[i].get_vector() - expt_ca[i].get_vector()).norm()
for i in range(min_len)
])
gdt_ts = 25 * sum(
(distances < threshold).mean() for threshold in [1, 2, 4, 8]
)
return {
"rmsd": sup.rms,
"gdt_ts": gdt_ts,
"n_residues": min_len,
}
Tradeoffs
| Tool | Speed | Accuracy | Use case |
|---|---|---|---|
| AlphaFold2 | Hours (with MSA) | Best for most proteins | Research, drug discovery |
| ColabFold | Minutes (MMseqs2 MSA) | Near AlphaFold2 quality | Rapid screening |
| ESMFold | Seconds (no MSA) | Good for well-studied families | Proteome-scale scanning |
| RoseTTAFold | Minutes | Good, especially for complexes | When AlphaFold struggles |
| OpenFold | Variable | Matches AlphaFold2 | When you need to retrain |
The one thing to remember: Python’s protein structure ecosystem — from AlphaFold prediction and confidence analysis through molecular docking and de novo design with RFdiffusion — has compressed what used to be years of experimental work into computational pipelines that run in hours, fundamentally changing how biology and drug discovery are done.
See Also
- Python Biopython Bioinformatics How Python helps scientists read the instruction manual hidden inside every living thing's DNA.
- Python Clinical Trial Analysis How Python helps scientists figure out whether a new medicine actually works by crunching the numbers from clinical trials.
- Python Drug Interaction Modeling How Python helps scientists figure out which medicines are safe to take together and which combinations could be dangerous.
- Python Genomics Sequencing How Python helps scientists read and understand the instruction manual written inside every cell of your body.
- Python Medical Image Analysis How Python helps doctors see inside your body more clearly by teaching computers to read X-rays, MRIs, and CT scans.