Chemistry Engine — ChemLib¶
Overview¶
The chemistry engine is the scientific core of ChemLib. It wraps RDKit functionality into clean, modular Python functions organized by concern. All chemistry modules live in chemlib/chemistry/ and have zero database dependencies — they operate purely on RDKit Mol objects, SMILES strings, and Python data structures.
1. Chemical Representations (representations.py)¶
Core Concepts¶
1D Representations (connectivity only, no coordinates):
- SMILES: O=C(O)Cc1ccccc1 — Human-readable line notation
- Canonical SMILES: Unique, deterministic form via Chem.MolToSmiles(mol, canonical=True)
- InChI: InChI=1S/C8H8O2/c9-8(10)6-7-4-2-1-3-5-7/h1-5H,6H2,(H,9,10)
- InChIKey: WLJVXDMOQOGPHL-UHFFFAOYSA-N — 27-char hash for indexing
2D Representations (connectivity + planar coordinates):
- MOL Block (V2000): Connection table with x,y coordinates for 2D depiction
- Generated via AllChem.Compute2DCoords(mol) then Chem.MolToMolBlock(mol)
3D Representations (connectivity + 3D coordinates):
- MOL Block with 3D coords: Full x,y,z atom positions
- Generated via AllChem.EmbedMolecule(mol) (distance geometry)
- SDF: Multiple MOL blocks + properties, separated by $$$$
Key Functions¶
def smiles_to_mol(smiles: str) -> Chem.Mol:
"""Parse SMILES. Raises InvalidSMILESError if parsing fails."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
raise InvalidSMILESError(f"Cannot parse SMILES: '{smiles}'")
return mol
def mol_to_canonical_smiles(mol: Chem.Mol) -> str:
"""Always use this before storing SMILES in the database."""
return Chem.MolToSmiles(mol, canonical=True)
def mol_to_inchi(mol: Chem.Mol) -> str:
return Chem.inchi.MolToInchi(mol)
def mol_to_inchi_key(mol: Chem.Mol) -> str:
inchi = mol_to_inchi(mol)
return Chem.inchi.InchiToInchiKey(inchi)
def mol_to_mol_block_2d(mol: Chem.Mol) -> str:
"""Generate MOL block with 2D coordinates for depiction."""
mol_2d = Chem.RWMol(mol) # Copy
AllChem.Compute2DCoords(mol_2d)
return Chem.MolToMolBlock(mol_2d)
def mol_to_formula(mol: Chem.Mol) -> str:
return Chem.rdMolDescriptors.CalcMolFormula(mol)
SMILES Canonicalization Rule¶
CRITICAL: Before storing any SMILES in the database, always canonicalize:
This ensures the same molecule always maps to the same string, preventing duplicates.2. Molecular Descriptors (descriptors.py)¶
Standard Properties¶
from rdkit.Chem import Descriptors, QED
def compute_properties(mol: Chem.Mol) -> dict:
"""Compute all standard molecular properties."""
return {
"mw": round(Descriptors.MolWt(mol), 2),
"exact_mw": round(Descriptors.ExactMolWt(mol), 4),
"logp": round(Descriptors.MolLogP(mol), 2),
"tpsa": round(Descriptors.TPSA(mol), 2),
"hbd": Descriptors.NumHDonors(mol),
"hba": Descriptors.NumHAcceptors(mol),
"num_rotatable": Descriptors.NumRotatableBonds(mol),
"num_rings": Descriptors.RingCount(mol),
"num_aromatic_rings": Descriptors.NumAromaticRings(mol),
"fraction_csp3": round(Descriptors.FractionCSP3(mol), 2),
"heavy_atom_count": Descriptors.HeavyAtomCount(mol),
"qed_score": round(QED.qed(mol), 3),
"formula": Chem.rdMolDescriptors.CalcMolFormula(mol),
}
3. Fingerprints (fingerprints.py)¶
Morgan (Circular) Fingerprints¶
Morgan fingerprints (equivalent to ECFP) encode the chemical neighborhood around each atom. They are the standard for molecular similarity searching.
from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import pickle
def compute_morgan_fp(
mol: Chem.Mol,
radius: int = 2, # radius=2 → ECFP4
nbits: int = 2048,
) -> DataStructs.ExplicitBitVect:
fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=radius, fpSize=nbits)
return fpgen.GetFingerprint(mol)
def serialize_fp(fp: DataStructs.ExplicitBitVect) -> bytes:
"""Serialize fingerprint for database storage."""
return pickle.dumps(fp)
def deserialize_fp(data: bytes) -> DataStructs.ExplicitBitVect:
"""Deserialize fingerprint from database."""
return pickle.loads(data)
def tanimoto_similarity(fp1, fp2) -> float:
return DataStructs.TanimotoSimilarity(fp1, fp2)
def bulk_tanimoto(query_fp, fp_list: list) -> list[float]:
return DataStructs.BulkTanimotoSimilarity(query_fp, fp_list)
Similarity Search Strategy¶
For small databases (<100K compounds), in-memory Tanimoto computation is fast enough:
1. Load all fingerprints from DB
2. Deserialize query and all target fingerprints
3. Use BulkTanimotoSimilarity for vectorized comparison
4. Return top-N matches above threshold
For larger databases, use PostgreSQL RDKit cartridge with GiST indexes.
4. Fragment Decomposition (fragmentation.py)¶
BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures)¶
BRICS decomposes molecules along bonds that correspond to real synthetic disconnections. It assigns numbered dummy atoms to broken bonds, encoding which types of bonds can reconnect.
from rdkit.Chem import BRICS
def brics_decompose(mol: Chem.Mol) -> list[str]:
"""
Decompose a molecule into BRICS fragments.
Returns list of SMILES strings with labeled dummy atoms,
e.g., ['[1*]C(=O)CC', '[3*]c1ccccc1']
"""
frags = BRICS.BRICSDecompose(mol)
return sorted(frags) # deterministic ordering
BRICS Dummy Atom Labels and Compatibility¶
BRICS defines 16 bond environment types. Fragments carry dummy atoms labeled [1*] through [16*]. Only specific pairs of labels are allowed to reconnect:
| Label | Chemical Environment | Compatible With |
|---|---|---|
| 1 | L-[C;!$(C=O)]-R (aliphatic C-C) | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16 |
| 3 | L-[C;!$(C=O)]-[O,S] | 4, 13, 14, 15, 16 |
| 4 | L-[C]-[N] | 3, 5, 11 |
| 5 | L-[N]-[C] | 1, 4, 12 |
| 6 | L-C-[N] (amide C) | 13, 14, 15, 16 |
| 7 | L-[C]-[C]=O (alpha to carbonyl) | 1 |
| 8 | L-C-[O] (ester C) | 9, 10, 13, 14, 15, 16 |
| 9 | L-[O]-[C]=O (ester O) | 8 |
| 10 | L-[N]-[C]=O (amide N) | 1, 8 |
| 11 | L-[S]-[C] | 1, 4 |
| 12 | L-[N]-[N] | 5 |
| 13 | L-C-[X] | 3, 6, 8 |
| 14 | L-Cring-[C,N] | 1, 3, 6, 8, 16 |
| 15 | L-Cring=[C,N] | 3, 6, 8 |
| 16 | L-Carom-[C,N] | 1, 3, 6, 8, 14 |
# BRICS compatibility rules (from RDKit source)
BRICS_COMPATIBLE = {
1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16},
3: {4, 13, 14, 15, 16},
4: {3, 5, 11},
5: {1, 4, 12},
6: {13, 14, 15, 16},
7: {1},
8: {9, 10, 13, 14, 15, 16},
9: {8},
10: {1, 8},
11: {1, 4},
12: {5},
13: {3, 6, 8},
14: {1, 3, 6, 8, 16},
15: {3, 6, 8},
16: {1, 3, 6, 8, 14},
}
def get_compatible_labels(label: int) -> set[int]:
return BRICS_COMPATIBLE.get(label, set())
def parse_attachment_points(frag_smiles: str) -> list[int]:
"""Extract BRICS dummy atom labels from a fragment SMILES."""
mol = Chem.MolFromSmiles(frag_smiles)
labels = []
for atom in mol.GetAtoms():
if atom.GetAtomicNum() == 0: # dummy atom
isotope = atom.GetIsotope()
if isotope > 0:
labels.append(isotope)
return sorted(labels)
def are_compatible(label1: int, label2: int) -> bool:
"""Check if two BRICS labels can form a bond."""
return label2 in BRICS_COMPATIBLE.get(label1, set())
Fragment Quality Criteria¶
When building the fragment library, apply these filters: - Minimum size: At least 3 heavy atoms (exclude trivial fragments) - Maximum size: At most 20 heavy atoms (keep fragments "fragment-sized") - Rule of Three (for lead-like fragments): MW ≤ 300, LogP ≤ 3, HBD ≤ 3, HBA ≤ 3, RotBonds ≤ 3
5. Molecule Assembly (assembly.py)¶
Fragment Joining Strategy¶
The assembly process connects two molecular fragments by: 1. Identifying matching dummy atom pairs (BRICS compatibility) 2. Removing the dummy atoms 3. Forming a new bond between the atoms that were connected to the dummies 4. Sanitizing the result
def join_fragments(frag1_smiles: str, frag2_smiles: str) -> list[str]:
"""
Join two BRICS fragments at compatible attachment points.
Returns all valid product SMILES (may be multiple if fragments
have multiple compatible attachment point pairs).
"""
# Use BRICS.BRICSBuild for controlled joining
allFrags = [Chem.MolFromSmiles(frag1_smiles), Chem.MolFromSmiles(frag2_smiles)]
products = set()
for mol in BRICS.BRICSBuild(allFrags):
smi = Chem.MolToSmiles(mol)
products.add(smi)
return sorted(products)
def validate_molecule(smiles: str) -> tuple[bool, list[str]]:
"""
Validate that a SMILES represents a chemically sane molecule.
Returns (is_valid, list_of_warnings).
"""
warnings = []
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return False, ["Cannot parse SMILES"]
# Check for sanitization issues
try:
Chem.SanitizeMol(mol)
except Exception as e:
return False, [f"Sanitization failed: {str(e)}"]
# Check valence
for atom in mol.GetAtoms():
if atom.GetAtomicNum() == 0:
warnings.append("Molecule contains dummy atoms (incomplete assembly)")
return len(warnings) == 0 or all("dummy" in w for w in warnings), warnings
def clean_assembled_mol(mol: Chem.Mol) -> Chem.Mol:
"""Remove leftover dummy atoms and sanitize."""
# Replace dummy atoms with hydrogens
rw_mol = Chem.RWMol(mol)
for atom in rw_mol.GetAtoms():
if atom.GetAtomicNum() == 0:
atom.SetAtomicNum(1) # Replace with H
mol = rw_mol.GetMol()
mol = Chem.RemoveHs(mol)
Chem.SanitizeMol(mol)
return mol
Assembly Workflow (Stepwise)¶
Step 1: User selects starting fragment → [3*]C(=O)O
Step 2: System finds compatible fragments (label 3 → compatible with 4,13,14,15,16)
Step 3: User selects fragment [4*]CCCN and attachment points (3↔4)
Step 4: System joins → O=C(O)CCCN, validates, shows result
Step 5: If more attachment points remain, repeat from Step 2
Step 6: Finalize → clean molecule, compute all properties
BRICS.BRICSBuild for Combinatorial Assembly¶
For automated exploration (not step-by-step):
def brics_build(
fragments: list[str],
max_results: int = 100,
max_depth: int = 3,
) -> list[str]:
"""
Combinatorially assemble fragments using BRICS rules.
Returns up to max_results unique product SMILES.
"""
frag_mols = [Chem.MolFromSmiles(s) for s in fragments]
results = []
for mol in BRICS.BRICSBuild(frag_mols):
smi = Chem.MolToSmiles(mol)
results.append(smi)
if len(results) >= max_results:
break
return results
6. Conformer Generation & Energy Minimization (conformers.py)¶
Overview¶
Finding the free energy minimum of a molecule involves: 1. Conformer generation: Create multiple 3D geometries via distance geometry (ETKDGv3) 2. Energy minimization: Optimize each conformer using a molecular mechanics force field 3. Selection: Identify the lowest-energy conformer as the most stable geometry
Force Fields¶
- MMFF94 (Merck Molecular Force Field): Recommended for organic/drug-like molecules. Well-parameterized for common functional groups.
- MMFF94s: Variant with modified out-of-plane bending term. Slightly better for planar systems.
- UFF (Universal Force Field): Broader element coverage. Use as fallback when MMFF94 lacks parameters.
Implementation¶
from rdkit.Chem import AllChem
def generate_conformers(
mol: Chem.Mol,
num_confs: int = 50,
seed: int = 42,
prune_rms_thresh: float = 0.5,
) -> Chem.Mol:
"""
Generate 3D conformers using ETKDGv3 (experimental torsion
knowledge-based distance geometry).
IMPORTANT: Input mol must have hydrogens added (Chem.AddHs).
"""
mol = Chem.AddHs(mol)
params = AllChem.ETKDGv3()
params.randomSeed = seed
params.pruneRmsThresh = prune_rms_thresh
params.numThreads = 0 # Use all available cores
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=num_confs, params=params)
if len(conf_ids) == 0:
raise ConformerError(f"Failed to generate any conformers for molecule")
return mol
def minimize_all_conformers(
mol: Chem.Mol,
force_field: str = "MMFF94",
max_iters: int = 500,
) -> list[dict]:
"""
Minimize all conformers and return energy data.
Returns list of dicts:
[{"conf_id": 0, "energy": -42.3, "converged": True}, ...]
"""
results = []
if force_field.upper().startswith("MMFF"):
opt_results = AllChem.MMFFOptimizeMoleculeConfs(
mol, numThreads=0, maxIters=max_iters
)
for conf_id, (not_converged, energy) in enumerate(opt_results):
results.append({
"conf_id": conf_id,
"energy": round(energy, 4),
"converged": not not_converged,
"force_field": force_field,
})
elif force_field.upper() == "UFF":
opt_results = AllChem.UFFOptimizeMoleculeConfs(
mol, numThreads=0, maxIters=max_iters
)
for conf_id, (not_converged, energy) in enumerate(opt_results):
results.append({
"conf_id": conf_id,
"energy": round(energy, 4),
"converged": not not_converged,
"force_field": "UFF",
})
return sorted(results, key=lambda x: x["energy"])
def get_lowest_energy_conformer(mol: Chem.Mol, results: list[dict]) -> int:
"""Return the conformer ID with the lowest energy."""
return results[0]["conf_id"] # results already sorted by energy
def conformer_to_mol_block(mol: Chem.Mol, conf_id: int) -> str:
"""Extract a single conformer as a MOL block string."""
return Chem.MolToMolBlock(mol, confId=conf_id)
Full Workflow: SMILES → Energy-Minimized 3D Structure¶
def smiles_to_minimized_3d(smiles: str) -> tuple[str, float]:
"""
Complete workflow: SMILES → 3D conformers → minimize → lowest energy.
Returns (mol_block_3d, energy).
"""
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol) # CRITICAL: must add H before embedding
# Generate conformers
params = AllChem.ETKDGv3()
params.randomSeed = 42
AllChem.EmbedMultipleConfs(mol, numConfs=50, params=params)
# Minimize all conformers
results = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads=0)
# Find lowest energy
energies = [(i, e) for i, (nc, e) in enumerate(results)]
best_id = min(energies, key=lambda x: x[1])[0]
best_energy = min(energies, key=lambda x: x[1])[1]
# Extract 3D MOL block
mol_block = Chem.MolToMolBlock(mol, confId=best_id)
return mol_block, best_energy
Important Notes¶
- Always add hydrogens before 3D embedding:
mol = Chem.AddHs(mol). Without this, conformer generation will be poor or fail entirely. - ETKDGv3 is preferred over older methods (ETKDG, ETKDGv2). It uses experimental torsion angle preferences for better geometries.
- 50 conformers is a good default for drug-sized molecules. For very flexible molecules (>10 rotatable bonds), increase to 200+.
- MMFF94 is the default force field. Fall back to UFF only if MMFF94 fails (returns -1 for certain element types it can't parameterize).
- Pruning threshold of 0.5 Å RMSD removes near-duplicate conformers.
7. Drug-Likeness Filters (filters.py)¶
Lipinski's Rule of Five¶
A compound is orally bioavailable if it satisfies ≥3 of 4 rules: - MW ≤ 500 Da - LogP ≤ 5 - H-bond donors ≤ 5 - H-bond acceptors ≤ 10
def check_lipinski(mol: Chem.Mol) -> dict:
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
violations = []
if mw > 500: violations.append("MW > 500")
if logp > 5: violations.append("LogP > 5")
if hbd > 5: violations.append("HBD > 5")
if hba > 10: violations.append("HBA > 10")
return {
"passes": len(violations) <= 1, # allows 1 violation
"violations": violations,
"num_violations": len(violations),
"properties": {"mw": mw, "logp": logp, "hbd": hbd, "hba": hba},
}
Veber Rules (Oral Bioavailability)¶
- Rotatable bonds ≤ 10
- TPSA ≤ 140 Ų
PAINS Filter (Pan-Assay Interference Compounds)¶
Flags molecules with substructures known to give false positives in biochemical assays:
from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams
def check_pains(mol: Chem.Mol) -> dict:
params = FilterCatalogParams()
params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
catalog = FilterCatalog(params)
matches = []
entry = catalog.GetFirstMatch(mol)
while entry:
matches.append(entry.GetDescription())
entry = catalog.GetFirstMatch(mol) # simplified; iterate properly
break # prevent infinite loop in this example
return {"passes": len(matches) == 0, "alerts": matches}
QED (Quantitative Estimate of Drug-likeness)¶
A continuous score from 0 to 1 integrating 8 desirability functions:
SA Score (Synthetic Accessibility)¶
Predicts ease of synthesis on a scale of 1 (easy) to 10 (hard):
from rdkit.Contrib.SA_Score import sascorer
def compute_sa_score(mol: Chem.Mol) -> float:
return round(sascorer.calculateScore(mol), 2)
Interpretation: - 1–3: Easy to synthesize - 3–5: Moderate difficulty - 5–7: Challenging - 7–10: Very difficult / likely not synthesizable
Comprehensive Report¶
def full_druglikeness_report(mol: Chem.Mol) -> dict:
return {
"lipinski": check_lipinski(mol),
"veber": check_veber(mol),
"pains": check_pains(mol),
"qed": compute_qed(mol),
"sa_score": compute_sa_score(mol),
}
8. SA Score Setup¶
The SA Score module (sascorer.py) requires a pre-computed fragment frequency file (fpscores.pkl.gz). This file contains fragment frequencies from ~1M PubChem molecules.
Setup¶
# In chemlib/chemistry/sa_score.py — wrapper
import os
from rdkit.Contrib.SA_Score import sascorer
# The fpscores file is auto-loaded by sascorer from its own directory.
# If needed, set the path explicitly:
# sascorer._fscores = pickle.load(gzip.open('path/to/fpscores.pkl.gz'))
Alternative: Copy sascorer.py Into Project¶
Since rdkit.Contrib is not always on the Python path (depends on installation method), it may be practical to copy sascorer.py and fpscores.pkl.gz into the project under chemlib/chemistry/vendor/.