Python for Drug Interaction Modeling — Deep Dive
System architecture for DDI prediction
A production drug-drug interaction (DDI) pipeline chains data acquisition, featurization, model training, and deployment into a clinical decision support system.
DrugBank/ChEMBL → ETL (Python) → Feature Store → Model Training →
Prediction API → EHR Integration (CDS Hooks)
Molecular featurization with RDKit
RDKit is the dominant open-source cheminformatics toolkit in Python. It converts SMILES strings (text representations of molecular structures) into numerical features.
Morgan fingerprints
Morgan (circular) fingerprints encode the local chemical environment around each atom up to a specified radius:
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
def drug_fingerprint(smiles: str, radius: int = 2, n_bits: int = 2048) -> np.ndarray:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
raise ValueError(f"Invalid SMILES: {smiles}")
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
return np.array(fp, dtype=np.float32)
# Example: Aspirin
aspirin_fp = drug_fingerprint("CC(=O)OC1=CC=CC=C1C(=O)O")
print(aspirin_fp.shape) # (2048,)
Molecular descriptors
Beyond fingerprints, RDKit computes physicochemical properties that influence drug interactions:
from rdkit.Chem import Descriptors
def compute_descriptors(smiles: str) -> dict:
mol = Chem.MolFromSmiles(smiles)
return {
"mol_weight": Descriptors.MolWt(mol),
"logp": Descriptors.MolLogP(mol),
"hbd": Descriptors.NumHDonors(mol),
"hba": Descriptors.NumHAcceptors(mol),
"tpsa": Descriptors.TPSA(mol),
"rotatable_bonds": Descriptors.NumRotatableBonds(mol),
"aromatic_rings": Descriptors.NumAromaticRings(mol),
}
Lipophilicity (LogP) and molecular weight influence which CYP450 enzymes metabolize a drug, directly affecting interaction potential.
Drug pair encoding
For pairwise prediction, combine two drug representations:
def encode_pair(smiles_a: str, smiles_b: str) -> np.ndarray:
fp_a = drug_fingerprint(smiles_a)
fp_b = drug_fingerprint(smiles_b)
# Concatenation + element-wise product captures both individual and joint features
return np.concatenate([fp_a, fp_b, fp_a * fp_b])
The element-wise product term captures substructure co-occurrence patterns that are relevant to interaction mechanisms.
Feature-based models with XGBoost
Data preparation from DrugBank
import pandas as pd
from sklearn.model_selection import train_test_split
# Assume drugbank_interactions.csv has columns: drug_a_smiles, drug_b_smiles, interacts
df = pd.read_csv("drugbank_interactions.csv")
# Generate negative samples (non-interacting pairs)
all_drugs = list(set(df["drug_a_smiles"].tolist() + df["drug_b_smiles"].tolist()))
positive_pairs = set(zip(df["drug_a_smiles"], df["drug_b_smiles"]))
negatives = []
rng = np.random.default_rng(42)
while len(negatives) < len(positive_pairs):
a, b = rng.choice(all_drugs, 2, replace=False)
if (a, b) not in positive_pairs and (b, a) not in positive_pairs:
negatives.append({"drug_a_smiles": a, "drug_b_smiles": b, "interacts": 0})
positive_pairs.add((a, b))
df_neg = pd.DataFrame(negatives)
df_full = pd.concat([df, df_neg], ignore_index=True)
Training and evaluation
import xgboost as xgb
from sklearn.metrics import roc_auc_score, average_precision_score
# Featurize all pairs
X = np.array([
encode_pair(row.drug_a_smiles, row.drug_b_smiles)
for row in df_full.itertuples()
])
y = df_full["interacts"].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = xgb.XGBClassifier(
n_estimators=500,
max_depth=8,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=1.0,
eval_metric="aucpr",
early_stopping_rounds=50,
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=50)
y_pred = model.predict_proba(X_test)[:, 1]
print(f"AUROC: {roc_auc_score(y_test, y_pred):.4f}")
print(f"AUPRC: {average_precision_score(y_test, y_pred):.4f}")
Typical results on DrugBank benchmark: AUROC 0.92-0.95, AUPRC 0.88-0.92.
Knowledge graph approach with PyKEEN
Knowledge graph embeddings model the drug interaction network as a multi-relational graph where entities (drugs, proteins, diseases) are embedded in continuous vector space.
from pykeen.pipeline import pipeline
from pykeen.triples import TriplesFactory
# Triples format: (head, relation, tail)
# e.g., ("aspirin", "inhibits", "COX-2"), ("aspirin", "interacts_with", "warfarin")
triples = TriplesFactory.from_path("drug_kg_triples.tsv")
training, testing = triples.split([0.8, 0.2], random_state=42)
result = pipeline(
training=training,
testing=testing,
model="RotatE",
model_kwargs={"embedding_dim": 256},
training_kwargs={"num_epochs": 500, "batch_size": 1024},
optimizer_kwargs={"lr": 1e-3},
evaluation_kwargs={"batch_size": 256},
)
print(f"Hits@10: {result.metric_results.get_metric('hits@10'):.4f}")
print(f"MRR: {result.metric_results.get_metric('mean_reciprocal_rank'):.4f}")
# Predict new interactions
result.save_to_directory("trained_ddi_model")
Predicting unseen interactions
from pykeen.predict import predict_target
# Which drugs might interact with metformin?
predictions = predict_target(
model=result.model,
head="metformin",
relation="interacts_with",
triples_factory=training,
)
top_10 = predictions.df.head(10)
print(top_10[["tail_label", "score"]])
Graph neural network model
For end-to-end learning on molecular graphs combined with interaction networks:
import torch
import torch.nn as nn
from torch_geometric.nn import GCNConv, global_mean_pool
from torch_geometric.data import Data, Batch
class MolecularGNN(nn.Module):
def __init__(self, atom_features: int, hidden: int, out: int):
super().__init__()
self.conv1 = GCNConv(atom_features, hidden)
self.conv2 = GCNConv(hidden, hidden)
self.conv3 = GCNConv(hidden, out)
def forward(self, x, edge_index, batch):
x = torch.relu(self.conv1(x, edge_index))
x = torch.relu(self.conv2(x, edge_index))
x = self.conv3(x, edge_index)
return global_mean_pool(x, batch)
class DDIPredictor(nn.Module):
def __init__(self, atom_features: int = 78, hidden: int = 128, n_interaction_types: int = 86):
super().__init__()
self.mol_encoder = MolecularGNN(atom_features, hidden, hidden)
self.classifier = nn.Sequential(
nn.Linear(hidden * 3, hidden),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden, n_interaction_types),
)
def forward(self, graph_a, graph_b):
emb_a = self.mol_encoder(graph_a.x, graph_a.edge_index, graph_a.batch)
emb_b = self.mol_encoder(graph_b.x, graph_b.edge_index, graph_b.batch)
combined = torch.cat([emb_a, emb_b, emb_a * emb_b], dim=-1)
return self.classifier(combined)
This architecture jointly learns molecular representations and interaction predictions. The shared encoder ensures that structural similarity between drugs translates into similar embeddings.
Multi-drug interaction modeling
Real patients take more than two drugs. Extending pairwise models to polypharmacy requires:
Set-based approaches
class PolypharmacyPredictor(nn.Module):
def __init__(self, drug_dim: int, hidden: int, n_side_effects: int):
super().__init__()
self.drug_embedding = nn.Embedding(n_drugs, drug_dim)
self.attention = nn.MultiheadAttention(drug_dim, num_heads=4)
self.decoder = nn.Linear(drug_dim, n_side_effects)
def forward(self, drug_ids: torch.Tensor):
# drug_ids: (batch, num_drugs)
embeds = self.drug_embedding(drug_ids) # (batch, num_drugs, dim)
embeds = embeds.transpose(0, 1) # (num_drugs, batch, dim)
attended, _ = self.attention(embeds, embeds, embeds)
pooled = attended.mean(dim=0) # (batch, dim)
return torch.sigmoid(self.decoder(pooled))
Attention mechanisms learn which drug combinations are most relevant for each side effect prediction.
Evaluation pitfalls
Data leakage through drug identity
If Drug A appears in both training and test sets (in different pairs), the model can memorize drug-specific interaction patterns. For realistic evaluation, use drug-level splits: all pairs containing Drug A go to either training or test, never both.
from sklearn.model_selection import GroupKFold
# Group by drug_a to prevent leakage
groups = df_full["drug_a_smiles"].values
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups):
# Drug A in test pairs never appears in training pairs
model.fit(X[train_idx], y[train_idx])
auroc = roc_auc_score(y[test_idx], model.predict_proba(X[test_idx])[:, 1])
Drug-level split AUROC is typically 5-15 points lower than random split AUROC — the honest measure of generalization.
Class imbalance
Known interactions are a small fraction of all possible pairs. With 2,000 drugs, there are ~2 million possible pairs but only ~200,000 known interactions. Negative sampling ratio and calibration significantly affect model utility.
Deployment as CDS Hook
The trained model serves predictions through a CDS Hooks endpoint integrated with the electronic health record:
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("ddi_xgb_model.pkl")
drug_smiles_db = load_drug_smiles_lookup()
@app.post("/cds-services/ddi-check")
async def check_interactions(request: dict):
new_drug = request["context"]["newDrugCode"]
current_drugs = request["context"]["currentMedications"]
new_smiles = drug_smiles_db[new_drug]
alerts = []
for drug_code in current_drugs:
current_smiles = drug_smiles_db[drug_code]
features = encode_pair(new_smiles, current_smiles)
prob = model.predict_proba(features.reshape(1, -1))[0, 1]
if prob > 0.7:
alerts.append({
"summary": f"High interaction risk: {new_drug} + {drug_code} ({prob:.0%})",
"indicator": "critical" if prob > 0.9 else "warning",
"source": {"label": "DDI Prediction Model v2"}
})
return {"cards": alerts}
Tradeoffs
| Approach | Strength | Weakness |
|---|---|---|
| Fingerprint + XGBoost | Fast, interpretable feature importance | Misses network effects |
| Knowledge graph embeddings | Captures transitive relations | Requires curated KG |
| Graph neural networks | End-to-end learning on structures | Data-hungry, slower training |
| Rule-based (DrugBank lookup) | High precision for known interactions | Zero coverage for novel drugs |
The one thing to remember: Building a drug interaction prediction system in Python requires combining molecular featurization (RDKit), graph-based modeling (PyKEEN or PyTorch Geometric), and careful evaluation with drug-level splits — because random splits dramatically overstate real-world performance.
See Also
- Python Biopython Bioinformatics How Python helps scientists read the instruction manual hidden inside every living thing's DNA.
- Python Clinical Trial Analysis How Python helps scientists figure out whether a new medicine actually works by crunching the numbers from clinical trials.
- Python Genomics Sequencing How Python helps scientists read and understand the instruction manual written inside every cell of your body.
- Python Medical Image Analysis How Python helps doctors see inside your body more clearly by teaching computers to read X-rays, MRIs, and CT scans.
- Python Pandemic Modeling How Python helps scientists predict the spread of diseases like COVID-19 and plan the best ways to slow them down.