RDKit for Chemistry — Deep Dive

Molecular representation internals

RDKit stores molecules as annotated graphs using a C++ core exposed through Python bindings via Boost.Python. Each atom carries properties (atomic number, formal charge, chirality, isotope, number of implicit hydrogens) and each bond stores its type (single, double, triple, aromatic) and stereochemistry.

SMILES parsing and canonicalization

from rdkit import Chem

# Parse SMILES into a molecule object
mol = Chem.MolFromSmiles("c1ccccc1")  # benzene

# Canonical SMILES — deterministic string for any input
canonical = Chem.MolToSmiles(mol)
# 'c1ccccc1'

# Different input, same molecule
mol2 = Chem.MolFromSmiles("C1=CC=CC=C1")
assert Chem.MolToSmiles(mol2) == canonical

Canonicalization uses Morgan’s algorithm to assign unique indices to atoms, producing a deterministic string. This is essential for deduplication in large databases.

SMARTS — pattern matching for molecules

SMARTS extends SMILES with query features: atom lists, bond queries, and recursive patterns.

from rdkit import Chem

# Find all molecules with a carboxylic acid group
pattern = Chem.MolFromSmarts("[CX3](=O)[OX2H1]")

aspirin = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")
matches = aspirin.GetSubstructMatches(pattern)
print(f"Carboxylic acid groups found: {len(matches)}")
# Output: 1

SMARTS queries are the foundation of reaction rules, pharmacophore definitions, and alert filters (e.g., flagging reactive functional groups).

Molecular descriptors in practice

from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

mol = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")  # Aspirin

properties = {
    "MW": Descriptors.MolWt(mol),
    "LogP": Descriptors.MolLogP(mol),
    "HBD": Lipinski.NumHDonors(mol),
    "HBA": Lipinski.NumHAcceptors(mol),
    "TPSA": Descriptors.TPSA(mol),
    "RotBonds": Lipinski.NumRotatableBonds(mol),
}

for name, value in properties.items():
    print(f"{name}: {value:.2f}")

Lipinski filtering at scale

import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
from rdkit.Chem import PandasTools

def passes_lipinski(mol):
    if mol is None:
        return False
    return (
        Descriptors.MolWt(mol) < 500
        and Descriptors.MolLogP(mol) < 5
        and Lipinski.NumHDonors(mol) <= 5
        and Lipinski.NumHAcceptors(mol) <= 10
    )

# Load a library
df = pd.read_csv("compound_library.csv")
df["mol"] = df["smiles"].apply(Chem.MolFromSmiles)
df["lipinski_pass"] = df["mol"].apply(passes_lipinski)

passing = df[df["lipinski_pass"]]
print(f"{len(passing)} of {len(df)} compounds pass Lipinski's Rule of Five")

Fingerprints — molecular search at scale

Morgan fingerprints (ECFP)

Morgan fingerprints capture circular substructures around each atom up to a given radius. ECFP4 (radius=2) and ECFP6 (radius=3) are standard in medicinal chemistry.

from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs

mol1 = Chem.MolFromSmiles("c1ccc(CC(=O)O)cc1")  # phenylacetic acid
mol2 = Chem.MolFromSmiles("c1ccc(CCC(=O)O)cc1") # hydrocinnamic acid

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2, nBits=2048)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2, nBits=2048)

similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f"Tanimoto similarity: {similarity:.3f}")
# These structurally similar molecules will show high similarity
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs

# Pre-compute fingerprints for the library
library_smiles = ["CCO", "c1ccccc1", "CC(=O)O", "c1ccc(O)cc1", "CC(=O)Oc1ccccc1C(=O)O"]
library_mols = [Chem.MolFromSmiles(s) for s in library_smiles]
library_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=2048) for m in library_mols]

# Query
query = Chem.MolFromSmiles("c1ccc(O)cc1")  # phenol
query_fp = AllChem.GetMorganFingerprintAsBitVect(query, 2, nBits=2048)

# Rank by similarity
results = []
for i, fp in enumerate(library_fps):
    sim = DataStructs.TanimotoSimilarity(query_fp, fp)
    results.append((library_smiles[i], sim))

results.sort(key=lambda x: x[1], reverse=True)
for smiles, sim in results[:3]:
    print(f"  {smiles}: {sim:.3f}")

For million-scale libraries, use DataStructs.BulkTanimotoSimilarity which is implemented in C++ and runs orders of magnitude faster than Python loops.

Chemical reactions

RDKit models reactions using SMIRKS (reaction SMARTS):

from rdkit import Chem
from rdkit.Chem import AllChem

# Amide bond formation: carboxylic acid + amine → amide + water
rxn = AllChem.ReactionFromSmarts("[C:1](=[O:2])[OH].[NH2:3][C:4]>>[C:1](=[O:2])[NH:3][C:4]")

acid = Chem.MolFromSmiles("CC(=O)O")   # acetic acid
amine = Chem.MolFromSmiles("CCN")       # ethylamine

products = rxn.RunReactants((acid, amine))
for product_set in products:
    for product in product_set:
        Chem.SanitizeMol(product)
        print(Chem.MolToSmiles(product))
# Output: CCNC(C)=O  (N-ethylacetamide)

Reaction handling powers retrosynthetic analysis tools like ASKCOS and forward enumeration in combinatorial library design.

3D conformer generation

from rdkit import Chem
from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles("c1ccc(CC(=O)O)cc1")
mol = Chem.AddHs(mol)  # Explicit hydrogens needed for 3D

# Generate multiple conformers
params = AllChem.ETKDGv3()
params.numThreads = 4
params.randomSeed = 42

conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=50, params=params)

# Optimize with MMFF94 force field
results = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads=4)

# Find lowest energy conformer
energies = [(conf_id, energy) for conf_id, (converged, energy) in zip(conf_ids, results)]
best_conf_id, best_energy = min(energies, key=lambda x: x[1])
print(f"Best conformer: {best_conf_id}, energy: {best_energy:.2f} kcal/mol")

# Write to SDF for visualization
writer = Chem.SDWriter("conformers.sdf")
for conf_id in conf_ids:
    writer.write(mol, confId=conf_id)
writer.close()

Integration with machine learning

RDKit is the standard featurizer in chemical ML pipelines:

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Sample data: SMILES and activity labels
data = [
    ("c1ccc(NC(=O)c2ccccc2)cc1", 1),
    ("CCO", 0),
    ("c1ccc(CC(=O)O)cc1", 1),
    ("CCCC", 0),
    ("c1ccc(NC(=O)Cc2ccccc2)cc1", 1),
    ("CC(C)O", 0),
    ("c1ccc(C(=O)Nc2ccccc2)cc1", 1),
    ("CCCCC", 0),
]

# Convert to fingerprint features
X = []
y = []
for smiles, label in data:
    mol = Chem.MolFromSmiles(smiles)
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
    X.append(np.array(fp))
    y.append(label)

X = np.array(X)
y = np.array(y)

# Train a simple classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
print(f"Test accuracy: {clf.score(X_test, y_test):.2f}")

Modern approaches use graph neural networks (GNNs) that operate directly on RDKit molecule graphs via libraries like PyTorch Geometric or DGL.

Performance patterns

TaskNaive approachOptimized approach
Parse 10M SMILESPython loop with MolFromSmilesChem.SmilesMolSupplierFromText in batches
Similarity searchPython loop over fingerprintsDataStructs.BulkTanimotoSimilarity (C++)
Descriptor calculationOne descriptor at a timeDescriptors.CalcMolDescriptors() (all at once)
Conformer generationSequential ETKDGETKDGv3 with numThreads parameter
Substructure filteringSMARTS match per moleculePre-screen with fingerprint first, then SMARTS

Memory considerations

RDKit molecule objects are relatively heavy (kilobytes each). For libraries exceeding available RAM, process molecules in streaming fashion:

supplier = Chem.SmilesMolSupplier("library.smi", titleLine=False)
for mol in supplier:
    if mol is not None:
        process(mol)

Common pitfalls

  1. Ignoring sanitization failures. MolFromSmiles returns None for invalid SMILES. Always check before calling methods on the result.
  2. Forgetting AddHs for 3D. Conformer generation requires explicit hydrogens. Without them, geometries are distorted.
  3. Comparing fingerprints of different sizes. Tanimoto similarity between 1024-bit and 2048-bit vectors gives meaningless results.
  4. Assuming canonical SMILES across toolkits. RDKit and Open Babel may produce different canonical SMILES for the same molecule. Use InChI for cross-toolkit comparison.

The one thing to remember: RDKit’s C++ core combined with Pythonic bindings makes it the industry standard for molecular manipulation — from SMILES parsing through descriptor computation to ML featurization — handling the chemistry so you can focus on the science.

pythonchemistryscience