RDKit for Chemistry — Deep Dive
Molecular representation internals
RDKit stores molecules as annotated graphs using a C++ core exposed through Python bindings via Boost.Python. Each atom carries properties (atomic number, formal charge, chirality, isotope, number of implicit hydrogens) and each bond stores its type (single, double, triple, aromatic) and stereochemistry.
SMILES parsing and canonicalization
from rdkit import Chem
# Parse SMILES into a molecule object
mol = Chem.MolFromSmiles("c1ccccc1") # benzene
# Canonical SMILES — deterministic string for any input
canonical = Chem.MolToSmiles(mol)
# 'c1ccccc1'
# Different input, same molecule
mol2 = Chem.MolFromSmiles("C1=CC=CC=C1")
assert Chem.MolToSmiles(mol2) == canonical
Canonicalization uses Morgan’s algorithm to assign unique indices to atoms, producing a deterministic string. This is essential for deduplication in large databases.
SMARTS — pattern matching for molecules
SMARTS extends SMILES with query features: atom lists, bond queries, and recursive patterns.
from rdkit import Chem
# Find all molecules with a carboxylic acid group
pattern = Chem.MolFromSmarts("[CX3](=O)[OX2H1]")
aspirin = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")
matches = aspirin.GetSubstructMatches(pattern)
print(f"Carboxylic acid groups found: {len(matches)}")
# Output: 1
SMARTS queries are the foundation of reaction rules, pharmacophore definitions, and alert filters (e.g., flagging reactive functional groups).
Molecular descriptors in practice
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirin
properties = {
"MW": Descriptors.MolWt(mol),
"LogP": Descriptors.MolLogP(mol),
"HBD": Lipinski.NumHDonors(mol),
"HBA": Lipinski.NumHAcceptors(mol),
"TPSA": Descriptors.TPSA(mol),
"RotBonds": Lipinski.NumRotatableBonds(mol),
}
for name, value in properties.items():
print(f"{name}: {value:.2f}")
Lipinski filtering at scale
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
from rdkit.Chem import PandasTools
def passes_lipinski(mol):
if mol is None:
return False
return (
Descriptors.MolWt(mol) < 500
and Descriptors.MolLogP(mol) < 5
and Lipinski.NumHDonors(mol) <= 5
and Lipinski.NumHAcceptors(mol) <= 10
)
# Load a library
df = pd.read_csv("compound_library.csv")
df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)
df["lipinski_pass"] = df["mol"].apply(passes_lipinski)
passing = df[df["lipinski_pass"]]
print(f"{len(passing)} of {len(df)} compounds pass Lipinski's Rule of Five")
Fingerprints — molecular search at scale
Morgan fingerprints (ECFP)
Morgan fingerprints capture circular substructures around each atom up to a given radius. ECFP4 (radius=2) and ECFP6 (radius=3) are standard in medicinal chemistry.
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
mol1 = Chem.MolFromSmiles("c1ccc(CC(=O)O)cc1") # phenylacetic acid
mol2 = Chem.MolFromSmiles("c1ccc(CCC(=O)O)cc1") # hydrocinnamic acid
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2, nBits=2048)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2, nBits=2048)
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f"Tanimoto similarity: {similarity:.3f}")
# These structurally similar molecules will show high similarity
Bulk similarity search
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
# Pre-compute fingerprints for the library
library_smiles = ["CCO", "c1ccccc1", "CC(=O)O", "c1ccc(O)cc1", "CC(=O)Oc1ccccc1C(=O)O"]
library_mols = [Chem.MolFromSmiles(s) for s in library_smiles]
library_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=2048) for m in library_mols]
# Query
query = Chem.MolFromSmiles("c1ccc(O)cc1") # phenol
query_fp = AllChem.GetMorganFingerprintAsBitVect(query, 2, nBits=2048)
# Rank by similarity
results = []
for i, fp in enumerate(library_fps):
sim = DataStructs.TanimotoSimilarity(query_fp, fp)
results.append((library_smiles[i], sim))
results.sort(key=lambda x: x[1], reverse=True)
for smiles, sim in results[:3]:
print(f" {smiles}: {sim:.3f}")
For million-scale libraries, use DataStructs.BulkTanimotoSimilarity which is implemented in C++ and runs orders of magnitude faster than Python loops.
Chemical reactions
RDKit models reactions using SMIRKS (reaction SMARTS):
from rdkit import Chem
from rdkit.Chem import AllChem
# Amide bond formation: carboxylic acid + amine → amide + water
rxn = AllChem.ReactionFromSmarts("[C:1](=[O:2])[OH].[NH2:3][C:4]>>[C:1](=[O:2])[NH:3][C:4]")
acid = Chem.MolFromSmiles("CC(=O)O") # acetic acid
amine = Chem.MolFromSmiles("CCN") # ethylamine
products = rxn.RunReactants((acid, amine))
for product_set in products:
for product in product_set:
Chem.SanitizeMol(product)
print(Chem.MolToSmiles(product))
# Output: CCNC(C)=O (N-ethylacetamide)
Reaction handling powers retrosynthetic analysis tools like ASKCOS and forward enumeration in combinatorial library design.
3D conformer generation
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("c1ccc(CC(=O)O)cc1")
mol = Chem.AddHs(mol) # Explicit hydrogens needed for 3D
# Generate multiple conformers
params = AllChem.ETKDGv3()
params.numThreads = 4
params.randomSeed = 42
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=50, params=params)
# Optimize with MMFF94 force field
results = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads=4)
# Find lowest energy conformer
energies = [(conf_id, energy) for conf_id, (converged, energy) in zip(conf_ids, results)]
best_conf_id, best_energy = min(energies, key=lambda x: x[1])
print(f"Best conformer: {best_conf_id}, energy: {best_energy:.2f} kcal/mol")
# Write to SDF for visualization
writer = Chem.SDWriter("conformers.sdf")
for conf_id in conf_ids:
writer.write(mol, confId=conf_id)
writer.close()
Integration with machine learning
RDKit is the standard featurizer in chemical ML pipelines:
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Sample data: SMILES and activity labels
data = [
("c1ccc(NC(=O)c2ccccc2)cc1", 1),
("CCO", 0),
("c1ccc(CC(=O)O)cc1", 1),
("CCCC", 0),
("c1ccc(NC(=O)Cc2ccccc2)cc1", 1),
("CC(C)O", 0),
("c1ccc(C(=O)Nc2ccccc2)cc1", 1),
("CCCCC", 0),
]
# Convert to fingerprint features
X = []
y = []
for smiles, label in data:
mol = Chem.MolFromSmiles(smiles)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
X.append(np.array(fp))
y.append(label)
X = np.array(X)
y = np.array(y)
# Train a simple classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
print(f"Test accuracy: {clf.score(X_test, y_test):.2f}")
Modern approaches use graph neural networks (GNNs) that operate directly on RDKit molecule graphs via libraries like PyTorch Geometric or DGL.
Performance patterns
| Task | Naive approach | Optimized approach |
|---|---|---|
| Parse 10M SMILES | Python loop with MolFromSmiles | Chem.SmilesMolSupplierFromText in batches |
| Similarity search | Python loop over fingerprints | DataStructs.BulkTanimotoSimilarity (C++) |
| Descriptor calculation | One descriptor at a time | Descriptors.CalcMolDescriptors() (all at once) |
| Conformer generation | Sequential ETKDG | ETKDGv3 with numThreads parameter |
| Substructure filtering | SMARTS match per molecule | Pre-screen with fingerprint first, then SMARTS |
Memory considerations
RDKit molecule objects are relatively heavy (kilobytes each). For libraries exceeding available RAM, process molecules in streaming fashion:
supplier = Chem.SmilesMolSupplier("library.smi", titleLine=False)
for mol in supplier:
if mol is not None:
process(mol)
Common pitfalls
- Ignoring sanitization failures.
MolFromSmilesreturnsNonefor invalid SMILES. Always check before calling methods on the result. - Forgetting AddHs for 3D. Conformer generation requires explicit hydrogens. Without them, geometries are distorted.
- Comparing fingerprints of different sizes. Tanimoto similarity between 1024-bit and 2048-bit vectors gives meaningless results.
- Assuming canonical SMILES across toolkits. RDKit and Open Babel may produce different canonical SMILES for the same molecule. Use InChI for cross-toolkit comparison.
The one thing to remember: RDKit’s C++ core combined with Pythonic bindings makes it the industry standard for molecular manipulation — from SMILES parsing through descriptor computation to ML featurization — handling the chemistry so you can focus on the science.