Python for Protein Structure Prediction — Core Concepts

Understand how Python tools like AlphaFold, ESMFold, and Biopython predict and analyze 3D protein structures from amino acid sequences.

Why protein structure prediction matters

A protein’s function is determined by its 3D structure. Understanding structure enables:

Drug design — finding molecules that fit into protein binding sites like keys into locks
Disease understanding — seeing how mutations change protein shape and cause disease
Enzyme engineering — redesigning proteins for industrial applications (biofuels, detergents, food production)

The gap between known sequences (~250 million in UniProt) and experimentally solved structures (~200,000 in the PDB) makes computational prediction essential.

The protein folding problem

Proteins are chains of amino acids (20 types). A typical protein has 300-500 amino acids. The chain folds into a specific 3D shape determined by the interactions between amino acids — hydrogen bonds, hydrophobic effects, electrostatic attractions.

The challenge: a 300-residue protein could theoretically adopt 10^143 different conformations (Levinthal’s paradox). Yet real proteins fold in milliseconds. Finding the correct shape computationally was called one of biology’s grand challenges.

AlphaFold2 — the breakthrough

DeepMind’s AlphaFold2 (2020) solved protein structure prediction for most single-domain proteins. Its architecture uses:

Multiple Sequence Alignment (MSA) — finds evolutionary relatives of the target protein. Conserved positions across species reveal structural constraints.
Evoformer — a transformer-based module that processes the MSA and pairwise distance information through 48 attention layers.
Structure Module — converts the processed representations into 3D coordinates with confidence scores.

AlphaFold achieved median GDT (Global Distance Test) scores above 90 on CASP14, where 100 is a perfect prediction. Previous methods scored around 60.

Running AlphaFold predictions

# AlphaFold requires significant setup (databases, GPU)
# ColabFold simplifies access via Google Colab or local installation

from colabfold.batch import get_queries, run
from pathlib import Path

queries = get_queries("input_sequences.fasta")
run(
    queries=queries,
    result_dir=Path("predictions"),
    num_models=5,
    num_recycles=3,
    model_type="alphafold2_ptm",
    use_templates=True,
)

The output includes:

PDB files — 3D coordinates of every atom
pLDDT scores — per-residue confidence (>90 = high confidence, <50 = likely disordered)
PAE (Predicted Aligned Error) — confidence in relative positions of residue pairs

ESMFold — faster alternative

Meta’s ESMFold uses a protein language model (ESM-2) instead of multiple sequence alignments, making predictions ~60x faster at the cost of some accuracy:

import torch
from esm import pretrained

model, alphabet = pretrained.esmfold_v1()
model = model.eval().cuda()

sequence = "MKFLILLFNILCLFPVLAADNHGVSMNAS..."

with torch.no_grad():
    output = model.infer_pdb(sequence)

# Write PDB file
with open("prediction.pdb", "w") as f:
    f.write(output)

ESMFold works well for rapid screening of large sequence databases where MSA computation would be prohibitive.

Analyzing structures with Python

Biopython for PDB parsing

from Bio.PDB import PDBParser, DSSP

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "prediction.pdb")

model = structure[0]
for chain in model:
    for residue in chain:
        if residue.id[0] == " ":  # Skip heteroatoms
            ca = residue["CA"]   # C-alpha atom
            print(f"{residue.get_resname()} {residue.id[1]}: {ca.get_vector()}")

PyMOL for visualization

import pymol
from pymol import cmd

cmd.load("prediction.pdb", "protein")
cmd.color("cyan", "protein")
cmd.show("cartoon", "protein")

# Color by confidence (pLDDT stored in B-factor column)
cmd.spectrum("b", "red_white_blue", "protein", minimum=0, maximum=100)
cmd.png("structure_confidence.png", width=1200, height=900, dpi=300)

Structural comparison (RMSD)

from Bio.PDB import Superimposer

sup = Superimposer()

# Get C-alpha atoms from both structures
fixed_atoms = [residue["CA"] for residue in fixed_chain if "CA" in residue]
moving_atoms = [residue["CA"] for residue in moving_chain if "CA" in residue]

sup.set_atoms(fixed_atoms, moving_atoms)
print(f"RMSD: {sup.rms:.2f} Å")

An RMSD below 2 Å generally indicates an accurate prediction for a protein domain.

From structure to drug discovery

Predicted structures enable virtual screening — computationally docking millions of small molecules into protein binding sites:

Predict target protein structure with AlphaFold
Identify binding pockets with fpocket or SiteMap
Dock candidate molecules with AutoDock Vina (Python bindings via vina)
Score and rank candidates for experimental testing

This pipeline, fully accessible through Python, has accelerated early-stage drug discovery from years to weeks for the computational phase.

Common misconception

“AlphaFold solved all of structural biology.” AlphaFold excels at single-domain, well-evolved proteins but struggles with protein complexes (improving with AlphaFold-Multimer), intrinsically disordered regions (which genuinely lack a fixed shape), and proteins with few evolutionary relatives. Membrane proteins and large multi-subunit complexes remain challenging. Experimental methods (cryo-EM, X-ray crystallography) are still essential.

Real-world impact

AlphaFold DB provides predicted structures for 200+ million proteins — nearly every known protein sequence. Researchers download and analyze these using Python.
Insilico Medicine used AlphaFold-predicted structures to identify a drug candidate for idiopathic pulmonary fibrosis, reaching clinical trials in record time.
The Protein Data Bank saw a surge in computational structure depositions, with Python pipelines for bulk download, filtering, and analysis.

The one thing to remember: Python bridges the gap between protein sequence databases and 3D structural understanding through AlphaFold and ESMFold, enabling drug discovery, disease research, and enzyme engineering at a scale that was impossible before 2020.

pythonbioinformaticsstructural-biology