Python for Protein Structure Prediction — Deep Dive

Production structure prediction pipeline

A complete protein structure prediction system involves more than running AlphaFold. It chains sequence analysis, structure prediction, quality assessment, and downstream applications.

Sequence → MSA Generation (MMseqs2/HHblits) → Structure Prediction (AlphaFold/ESMFold) → 
  Quality Assessment (pLDDT/PAE) → Refinement → Docking/Design

AlphaFold2 architecture internals

Understanding the model helps interpret its outputs and failure modes.

Input processing

AlphaFold takes two inputs:

  1. MSA features — aligned sequences of evolutionary relatives, typically from UniRef90 and MGnify databases
  2. Template features — experimentally solved structures of homologs from PDB70
# Using ColabFold's MSA generation (faster than original AlphaFold pipeline)
from colabfold.mmseqs2 import get_msa

msa_results = get_msa(
    query_sequence="MKFLILLFNILCLFPVLAADNHGVSMNAS...",
    databases=["uniref30", "colabfold_envdb"],
    use_pairing=False,  # True for multimer predictions
)

# MSA depth affects prediction quality
print(f"MSA depth: {len(msa_results['msa'])}")  # >100 preferred

Evoformer block

The Evoformer processes two representations simultaneously:

  • MSA representation (N_seq × L × d) — information about evolutionary conservation
  • Pair representation (L × L × d) — pairwise distance/angle information

These representations communicate through row attention, column attention, and outer product mean operations across 48 blocks. The pair representation converges toward the true distance map.

Structure module and recycling

The structure module converts pair representations into 3D coordinates using Invariant Point Attention (IPA) — a geometric attention mechanism that operates in the protein’s local reference frame:

# AlphaFold uses recycling: predictions feed back into the Evoformer
# Typical: 3 recycles, each refining the structure
# More recycles help difficult targets

config = {
    "num_recycles": 12,      # Up to 12 for challenging proteins
    "recycle_early_stop_tolerance": 0.5,  # Stop if change < 0.5Å
    "num_models": 5,          # Rank multiple model seeds
}

Confidence metrics — interpretation guide

pLDDT (predicted Local Distance Difference Test)

Per-residue confidence score (0-100) stored in the B-factor column:

from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser(QUIET=True)
structure = parser.get_structure("pred", "ranked_0.pdb")

plddt_scores = []
for residue in structure[0]["A"]:
    if "CA" in residue:
        plddt_scores.append(residue["CA"].get_bfactor())

plddt = np.array(plddt_scores)
print(f"Mean pLDDT: {plddt.mean():.1f}")
print(f"Confident residues (>90): {(plddt > 90).sum()} / {len(plddt)}")
print(f"Disordered regions (<50): {(plddt < 50).sum()} / {len(plddt)}")

Interpretation:

  • >90 — High confidence, backbone likely accurate
  • 70-90 — Good confidence, overall fold correct
  • 50-70 — Low confidence, use with caution
  • <50 — Very low confidence, likely intrinsically disordered

PAE (Predicted Aligned Error)

A L×L matrix indicating confidence in the relative position of residue pairs:

import json
import matplotlib.pyplot as plt

with open("ranked_0_pae.json") as f:
    pae_data = json.load(f)

pae_matrix = np.array(pae_data["predicted_aligned_error"])

plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap="bwr_r", vmin=0, vmax=30)
plt.colorbar(label="Expected position error (Å)")
plt.xlabel("Scored residue")
plt.ylabel("Aligned residue")
plt.title("Predicted Aligned Error")
plt.savefig("pae_plot.png", dpi=150)

Low PAE (blue) between two residue ranges indicates confident relative positioning — critical for identifying rigid domains vs flexible linkers.

Domain identification from PAE

from scipy.cluster.hierarchy import fcluster, linkage
from scipy.spatial.distance import squareform

def identify_domains(pae_matrix: np.ndarray, threshold: float = 12.0) -> np.ndarray:
    """Cluster residues into domains based on PAE matrix."""
    # Convert PAE to distance matrix
    pae_symmetric = (pae_matrix + pae_matrix.T) / 2
    condensed = squareform(pae_symmetric, checks=False)
    
    Z = linkage(condensed, method="average")
    domains = fcluster(Z, t=threshold, criterion="distance")
    return domains

domains = identify_domains(pae_matrix)
print(f"Found {len(set(domains))} domains")
for d in sorted(set(domains)):
    residues = np.where(domains == d)[0]
    print(f"  Domain {d}: residues {residues[0]+1}-{residues[-1]+1}")

Multimer prediction with AlphaFold-Multimer

Predicting protein complexes requires special handling:

from colabfold.batch import run

# Separate chains with ":" in the sequence
complex_sequence = "CHAIN_A_SEQUENCE:CHAIN_B_SEQUENCE"

run(
    queries=[("complex", complex_sequence, None)],
    result_dir=Path("complex_prediction"),
    model_type="alphafold2_multimer_v3",
    num_models=5,
    num_recycles=20,  # Complexes benefit from more recycles
    recycle_early_stop_tolerance=0.5,
)

Evaluate complex predictions by checking inter-chain PAE — low PAE between chains indicates a confident interface prediction.

Molecular docking with predicted structures

Binding site detection

from biopandas.pdb import PandasPdb

def find_pockets_fpocket(pdb_path: str) -> list[dict]:
    """Run fpocket and parse results."""
    import subprocess
    subprocess.run(["fpocket", "-f", pdb_path], check=True)
    
    # Parse fpocket output
    ppdb = PandasPdb().read_pdb(f"{pdb_path.replace('.pdb', '_out/pockets')}/pocket1_atm.pdb")
    pocket_center = ppdb.df["ATOM"][["x_coord", "y_coord", "z_coord"]].mean()
    return [{"center": pocket_center.values, "pdb": ppdb}]

Virtual screening with AutoDock Vina

from vina import Vina

def dock_compound(receptor_pdb: str, ligand_sdf: str, 
                   center: tuple, box_size: tuple = (20, 20, 20)) -> dict:
    v = Vina(sf_name="vina")
    v.set_receptor(receptor_pdb)
    v.set_ligand_from_file(ligand_sdf)
    
    v.compute_vina_maps(center=center, box_size=box_size)
    v.dock(exhaustiveness=32, n_poses=10)
    
    energies = v.energies()
    return {
        "best_affinity": energies[0][0],  # kcal/mol
        "poses": v.poses(),
    }

# Screen a library
results = []
for ligand_file in Path("library/").glob("*.sdf"):
    result = dock_compound("target.pdbqt", str(ligand_file), center=(10.5, 23.1, -5.2))
    results.append({"ligand": ligand_file.stem, "affinity": result["best_affinity"]})

results_df = pd.DataFrame(results).sort_values("affinity")
print(results_df.head(20))  # Top 20 candidates

Protein design with RFdiffusion

Beyond prediction, Python tools now design novel proteins. RFdiffusion (Baker Lab) generates protein backbones using diffusion models:

# RFdiffusion generates novel protein structures
# that fold into desired shapes

# Design a protein that binds a target
# Specify target structure and desired binding site
config = {
    "inference": {
        "input_pdb": "target_protein.pdb",
        "num_designs": 100,
        "design_startnum": 0,
    },
    "contigmap": {
        "contigs": ["A10-40/0 B1-100/0"],  # Design 10-40 residues binding to chain B
    },
    "diffuser": {
        "T": 50,  # Diffusion steps
    }
}

After backbone generation, ProteinMPNN (also Python) designs amino acid sequences that fold into the generated backbone:

# ProteinMPNN: sequence design for a given backbone
from protein_mpnn_utils import ProteinMPNN

model = ProteinMPNN()
sequences = model.design(
    pdb_path="designed_backbone.pdb",
    num_sequences=8,
    temperature=0.1,  # Lower = more conservative
)

for seq in sequences:
    print(f"Score: {seq.score:.2f}, Sequence: {seq.sequence[:50]}...")

The design cycle: RFdiffusion (backbone) → ProteinMPNN (sequence) → AlphaFold (validate fold) → experimental testing.

Benchmarking and validation

CASP-style evaluation

from Bio.PDB import Superimposer
from Bio.PDB.DSSP import DSSP

def evaluate_prediction(predicted_pdb: str, experimental_pdb: str) -> dict:
    parser = PDBParser(QUIET=True)
    pred = parser.get_structure("pred", predicted_pdb)
    expt = parser.get_structure("expt", experimental_pdb)
    
    # C-alpha RMSD
    pred_ca = [r["CA"] for r in pred[0]["A"] if "CA" in r]
    expt_ca = [r["CA"] for r in expt[0]["A"] if "CA" in r]
    
    min_len = min(len(pred_ca), len(expt_ca))
    sup = Superimposer()
    sup.set_atoms(expt_ca[:min_len], pred_ca[:min_len])
    
    # GDT-TS (Global Distance Test - Total Score)
    distances = np.array([
        (pred_ca[i].get_vector() - expt_ca[i].get_vector()).norm()
        for i in range(min_len)
    ])
    
    gdt_ts = 25 * sum(
        (distances < threshold).mean() for threshold in [1, 2, 4, 8]
    )
    
    return {
        "rmsd": sup.rms,
        "gdt_ts": gdt_ts,
        "n_residues": min_len,
    }

Tradeoffs

ToolSpeedAccuracyUse case
AlphaFold2Hours (with MSA)Best for most proteinsResearch, drug discovery
ColabFoldMinutes (MMseqs2 MSA)Near AlphaFold2 qualityRapid screening
ESMFoldSeconds (no MSA)Good for well-studied familiesProteome-scale scanning
RoseTTAFoldMinutesGood, especially for complexesWhen AlphaFold struggles
OpenFoldVariableMatches AlphaFold2When you need to retrain

The one thing to remember: Python’s protein structure ecosystem — from AlphaFold prediction and confidence analysis through molecular docking and de novo design with RFdiffusion — has compressed what used to be years of experimental work into computational pipelines that run in hours, fundamentally changing how biology and drug discovery are done.

pythonbioinformaticsstructural-biology

See Also