Skip to content

Chemistry Engine — ChemLib

Overview

The chemistry engine is the scientific core of ChemLib. It wraps RDKit functionality into clean, modular Python functions organized by concern. All chemistry modules live in chemlib/chemistry/ and have zero database dependencies — they operate purely on RDKit Mol objects, SMILES strings, and Python data structures.


1. Chemical Representations (representations.py)

Core Concepts

1D Representations (connectivity only, no coordinates): - SMILES: O=C(O)Cc1ccccc1 — Human-readable line notation - Canonical SMILES: Unique, deterministic form via Chem.MolToSmiles(mol, canonical=True) - InChI: InChI=1S/C8H8O2/c9-8(10)6-7-4-2-1-3-5-7/h1-5H,6H2,(H,9,10) - InChIKey: WLJVXDMOQOGPHL-UHFFFAOYSA-N — 27-char hash for indexing

2D Representations (connectivity + planar coordinates): - MOL Block (V2000): Connection table with x,y coordinates for 2D depiction - Generated via AllChem.Compute2DCoords(mol) then Chem.MolToMolBlock(mol)

3D Representations (connectivity + 3D coordinates): - MOL Block with 3D coords: Full x,y,z atom positions - Generated via AllChem.EmbedMolecule(mol) (distance geometry) - SDF: Multiple MOL blocks + properties, separated by $$$$

Key Functions

def smiles_to_mol(smiles: str) -> Chem.Mol:
    """Parse SMILES. Raises InvalidSMILESError if parsing fails."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        raise InvalidSMILESError(f"Cannot parse SMILES: '{smiles}'")
    return mol

def mol_to_canonical_smiles(mol: Chem.Mol) -> str:
    """Always use this before storing SMILES in the database."""
    return Chem.MolToSmiles(mol, canonical=True)

def mol_to_inchi(mol: Chem.Mol) -> str:
    return Chem.inchi.MolToInchi(mol)

def mol_to_inchi_key(mol: Chem.Mol) -> str:
    inchi = mol_to_inchi(mol)
    return Chem.inchi.InchiToInchiKey(inchi)

def mol_to_mol_block_2d(mol: Chem.Mol) -> str:
    """Generate MOL block with 2D coordinates for depiction."""
    mol_2d = Chem.RWMol(mol)  # Copy
    AllChem.Compute2DCoords(mol_2d)
    return Chem.MolToMolBlock(mol_2d)

def mol_to_formula(mol: Chem.Mol) -> str:
    return Chem.rdMolDescriptors.CalcMolFormula(mol)

SMILES Canonicalization Rule

CRITICAL: Before storing any SMILES in the database, always canonicalize:

canonical = mol_to_canonical_smiles(smiles_to_mol(input_smiles))
This ensures the same molecule always maps to the same string, preventing duplicates.


2. Molecular Descriptors (descriptors.py)

Standard Properties

from rdkit.Chem import Descriptors, QED

def compute_properties(mol: Chem.Mol) -> dict:
    """Compute all standard molecular properties."""
    return {
        "mw": round(Descriptors.MolWt(mol), 2),
        "exact_mw": round(Descriptors.ExactMolWt(mol), 4),
        "logp": round(Descriptors.MolLogP(mol), 2),
        "tpsa": round(Descriptors.TPSA(mol), 2),
        "hbd": Descriptors.NumHDonors(mol),
        "hba": Descriptors.NumHAcceptors(mol),
        "num_rotatable": Descriptors.NumRotatableBonds(mol),
        "num_rings": Descriptors.RingCount(mol),
        "num_aromatic_rings": Descriptors.NumAromaticRings(mol),
        "fraction_csp3": round(Descriptors.FractionCSP3(mol), 2),
        "heavy_atom_count": Descriptors.HeavyAtomCount(mol),
        "qed_score": round(QED.qed(mol), 3),
        "formula": Chem.rdMolDescriptors.CalcMolFormula(mol),
    }

3. Fingerprints (fingerprints.py)

Morgan (Circular) Fingerprints

Morgan fingerprints (equivalent to ECFP) encode the chemical neighborhood around each atom. They are the standard for molecular similarity searching.

from rdkit.Chem import rdFingerprintGenerator
from rdkit import DataStructs
import pickle

def compute_morgan_fp(
    mol: Chem.Mol,
    radius: int = 2,       # radius=2 → ECFP4
    nbits: int = 2048,
) -> DataStructs.ExplicitBitVect:
    fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=radius, fpSize=nbits)
    return fpgen.GetFingerprint(mol)

def serialize_fp(fp: DataStructs.ExplicitBitVect) -> bytes:
    """Serialize fingerprint for database storage."""
    return pickle.dumps(fp)

def deserialize_fp(data: bytes) -> DataStructs.ExplicitBitVect:
    """Deserialize fingerprint from database."""
    return pickle.loads(data)

def tanimoto_similarity(fp1, fp2) -> float:
    return DataStructs.TanimotoSimilarity(fp1, fp2)

def bulk_tanimoto(query_fp, fp_list: list) -> list[float]:
    return DataStructs.BulkTanimotoSimilarity(query_fp, fp_list)

Similarity Search Strategy

For small databases (<100K compounds), in-memory Tanimoto computation is fast enough: 1. Load all fingerprints from DB 2. Deserialize query and all target fingerprints 3. Use BulkTanimotoSimilarity for vectorized comparison 4. Return top-N matches above threshold

For larger databases, use PostgreSQL RDKit cartridge with GiST indexes.


4. Fragment Decomposition (fragmentation.py)

BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures)

BRICS decomposes molecules along bonds that correspond to real synthetic disconnections. It assigns numbered dummy atoms to broken bonds, encoding which types of bonds can reconnect.

from rdkit.Chem import BRICS

def brics_decompose(mol: Chem.Mol) -> list[str]:
    """
    Decompose a molecule into BRICS fragments.

    Returns list of SMILES strings with labeled dummy atoms,
    e.g., ['[1*]C(=O)CC', '[3*]c1ccccc1']
    """
    frags = BRICS.BRICSDecompose(mol)
    return sorted(frags)  # deterministic ordering

BRICS Dummy Atom Labels and Compatibility

BRICS defines 16 bond environment types. Fragments carry dummy atoms labeled [1*] through [16*]. Only specific pairs of labels are allowed to reconnect:

Label Chemical Environment Compatible With
1 L-[C;!$(C=O)]-R (aliphatic C-C) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16
3 L-[C;!$(C=O)]-[O,S] 4, 13, 14, 15, 16
4 L-[C]-[N] 3, 5, 11
5 L-[N]-[C] 1, 4, 12
6 L-C-[N] (amide C) 13, 14, 15, 16
7 L-[C]-[C]=O (alpha to carbonyl) 1
8 L-C-[O] (ester C) 9, 10, 13, 14, 15, 16
9 L-[O]-[C]=O (ester O) 8
10 L-[N]-[C]=O (amide N) 1, 8
11 L-[S]-[C] 1, 4
12 L-[N]-[N] 5
13 L-C-[X] 3, 6, 8
14 L-Cring-[C,N] 1, 3, 6, 8, 16
15 L-Cring=[C,N] 3, 6, 8
16 L-Carom-[C,N] 1, 3, 6, 8, 14
# BRICS compatibility rules (from RDKit source)
BRICS_COMPATIBLE = {
    1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16},
    3: {4, 13, 14, 15, 16},
    4: {3, 5, 11},
    5: {1, 4, 12},
    6: {13, 14, 15, 16},
    7: {1},
    8: {9, 10, 13, 14, 15, 16},
    9: {8},
    10: {1, 8},
    11: {1, 4},
    12: {5},
    13: {3, 6, 8},
    14: {1, 3, 6, 8, 16},
    15: {3, 6, 8},
    16: {1, 3, 6, 8, 14},
}

def get_compatible_labels(label: int) -> set[int]:
    return BRICS_COMPATIBLE.get(label, set())

def parse_attachment_points(frag_smiles: str) -> list[int]:
    """Extract BRICS dummy atom labels from a fragment SMILES."""
    mol = Chem.MolFromSmiles(frag_smiles)
    labels = []
    for atom in mol.GetAtoms():
        if atom.GetAtomicNum() == 0:  # dummy atom
            isotope = atom.GetIsotope()
            if isotope > 0:
                labels.append(isotope)
    return sorted(labels)

def are_compatible(label1: int, label2: int) -> bool:
    """Check if two BRICS labels can form a bond."""
    return label2 in BRICS_COMPATIBLE.get(label1, set())

Fragment Quality Criteria

When building the fragment library, apply these filters: - Minimum size: At least 3 heavy atoms (exclude trivial fragments) - Maximum size: At most 20 heavy atoms (keep fragments "fragment-sized") - Rule of Three (for lead-like fragments): MW ≤ 300, LogP ≤ 3, HBD ≤ 3, HBA ≤ 3, RotBonds ≤ 3


5. Molecule Assembly (assembly.py)

Fragment Joining Strategy

The assembly process connects two molecular fragments by: 1. Identifying matching dummy atom pairs (BRICS compatibility) 2. Removing the dummy atoms 3. Forming a new bond between the atoms that were connected to the dummies 4. Sanitizing the result

def join_fragments(frag1_smiles: str, frag2_smiles: str) -> list[str]:
    """
    Join two BRICS fragments at compatible attachment points.

    Returns all valid product SMILES (may be multiple if fragments
    have multiple compatible attachment point pairs).
    """
    # Use BRICS.BRICSBuild for controlled joining
    allFrags = [Chem.MolFromSmiles(frag1_smiles), Chem.MolFromSmiles(frag2_smiles)]
    products = set()
    for mol in BRICS.BRICSBuild(allFrags):
        smi = Chem.MolToSmiles(mol)
        products.add(smi)
    return sorted(products)

def validate_molecule(smiles: str) -> tuple[bool, list[str]]:
    """
    Validate that a SMILES represents a chemically sane molecule.

    Returns (is_valid, list_of_warnings).
    """
    warnings = []
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return False, ["Cannot parse SMILES"]

    # Check for sanitization issues
    try:
        Chem.SanitizeMol(mol)
    except Exception as e:
        return False, [f"Sanitization failed: {str(e)}"]

    # Check valence
    for atom in mol.GetAtoms():
        if atom.GetAtomicNum() == 0:
            warnings.append("Molecule contains dummy atoms (incomplete assembly)")

    return len(warnings) == 0 or all("dummy" in w for w in warnings), warnings

def clean_assembled_mol(mol: Chem.Mol) -> Chem.Mol:
    """Remove leftover dummy atoms and sanitize."""
    # Replace dummy atoms with hydrogens
    rw_mol = Chem.RWMol(mol)
    for atom in rw_mol.GetAtoms():
        if atom.GetAtomicNum() == 0:
            atom.SetAtomicNum(1)  # Replace with H
    mol = rw_mol.GetMol()
    mol = Chem.RemoveHs(mol)
    Chem.SanitizeMol(mol)
    return mol

Assembly Workflow (Stepwise)

Step 1: User selects starting fragment → [3*]C(=O)O
Step 2: System finds compatible fragments (label 3 → compatible with 4,13,14,15,16)
Step 3: User selects fragment [4*]CCCN and attachment points (3↔4)
Step 4: System joins → O=C(O)CCCN, validates, shows result
Step 5: If more attachment points remain, repeat from Step 2
Step 6: Finalize → clean molecule, compute all properties

BRICS.BRICSBuild for Combinatorial Assembly

For automated exploration (not step-by-step):

def brics_build(
    fragments: list[str],
    max_results: int = 100,
    max_depth: int = 3,
) -> list[str]:
    """
    Combinatorially assemble fragments using BRICS rules.
    Returns up to max_results unique product SMILES.
    """
    frag_mols = [Chem.MolFromSmiles(s) for s in fragments]
    results = []
    for mol in BRICS.BRICSBuild(frag_mols):
        smi = Chem.MolToSmiles(mol)
        results.append(smi)
        if len(results) >= max_results:
            break
    return results


6. Conformer Generation & Energy Minimization (conformers.py)

Overview

Finding the free energy minimum of a molecule involves: 1. Conformer generation: Create multiple 3D geometries via distance geometry (ETKDGv3) 2. Energy minimization: Optimize each conformer using a molecular mechanics force field 3. Selection: Identify the lowest-energy conformer as the most stable geometry

Force Fields

  • MMFF94 (Merck Molecular Force Field): Recommended for organic/drug-like molecules. Well-parameterized for common functional groups.
  • MMFF94s: Variant with modified out-of-plane bending term. Slightly better for planar systems.
  • UFF (Universal Force Field): Broader element coverage. Use as fallback when MMFF94 lacks parameters.

Implementation

from rdkit.Chem import AllChem

def generate_conformers(
    mol: Chem.Mol,
    num_confs: int = 50,
    seed: int = 42,
    prune_rms_thresh: float = 0.5,
) -> Chem.Mol:
    """
    Generate 3D conformers using ETKDGv3 (experimental torsion
    knowledge-based distance geometry).

    IMPORTANT: Input mol must have hydrogens added (Chem.AddHs).
    """
    mol = Chem.AddHs(mol)
    params = AllChem.ETKDGv3()
    params.randomSeed = seed
    params.pruneRmsThresh = prune_rms_thresh
    params.numThreads = 0  # Use all available cores

    conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=num_confs, params=params)

    if len(conf_ids) == 0:
        raise ConformerError(f"Failed to generate any conformers for molecule")

    return mol

def minimize_all_conformers(
    mol: Chem.Mol,
    force_field: str = "MMFF94",
    max_iters: int = 500,
) -> list[dict]:
    """
    Minimize all conformers and return energy data.

    Returns list of dicts:
    [{"conf_id": 0, "energy": -42.3, "converged": True}, ...]
    """
    results = []

    if force_field.upper().startswith("MMFF"):
        opt_results = AllChem.MMFFOptimizeMoleculeConfs(
            mol, numThreads=0, maxIters=max_iters
        )
        for conf_id, (not_converged, energy) in enumerate(opt_results):
            results.append({
                "conf_id": conf_id,
                "energy": round(energy, 4),
                "converged": not not_converged,
                "force_field": force_field,
            })
    elif force_field.upper() == "UFF":
        opt_results = AllChem.UFFOptimizeMoleculeConfs(
            mol, numThreads=0, maxIters=max_iters
        )
        for conf_id, (not_converged, energy) in enumerate(opt_results):
            results.append({
                "conf_id": conf_id,
                "energy": round(energy, 4),
                "converged": not not_converged,
                "force_field": "UFF",
            })

    return sorted(results, key=lambda x: x["energy"])

def get_lowest_energy_conformer(mol: Chem.Mol, results: list[dict]) -> int:
    """Return the conformer ID with the lowest energy."""
    return results[0]["conf_id"]  # results already sorted by energy

def conformer_to_mol_block(mol: Chem.Mol, conf_id: int) -> str:
    """Extract a single conformer as a MOL block string."""
    return Chem.MolToMolBlock(mol, confId=conf_id)

Full Workflow: SMILES → Energy-Minimized 3D Structure

def smiles_to_minimized_3d(smiles: str) -> tuple[str, float]:
    """
    Complete workflow: SMILES → 3D conformers → minimize → lowest energy.
    Returns (mol_block_3d, energy).
    """
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)  # CRITICAL: must add H before embedding

    # Generate conformers
    params = AllChem.ETKDGv3()
    params.randomSeed = 42
    AllChem.EmbedMultipleConfs(mol, numConfs=50, params=params)

    # Minimize all conformers
    results = AllChem.MMFFOptimizeMoleculeConfs(mol, numThreads=0)

    # Find lowest energy
    energies = [(i, e) for i, (nc, e) in enumerate(results)]
    best_id = min(energies, key=lambda x: x[1])[0]
    best_energy = min(energies, key=lambda x: x[1])[1]

    # Extract 3D MOL block
    mol_block = Chem.MolToMolBlock(mol, confId=best_id)

    return mol_block, best_energy

Important Notes

  1. Always add hydrogens before 3D embedding: mol = Chem.AddHs(mol). Without this, conformer generation will be poor or fail entirely.
  2. ETKDGv3 is preferred over older methods (ETKDG, ETKDGv2). It uses experimental torsion angle preferences for better geometries.
  3. 50 conformers is a good default for drug-sized molecules. For very flexible molecules (>10 rotatable bonds), increase to 200+.
  4. MMFF94 is the default force field. Fall back to UFF only if MMFF94 fails (returns -1 for certain element types it can't parameterize).
  5. Pruning threshold of 0.5 Å RMSD removes near-duplicate conformers.

7. Drug-Likeness Filters (filters.py)

Lipinski's Rule of Five

A compound is orally bioavailable if it satisfies ≥3 of 4 rules: - MW ≤ 500 Da - LogP ≤ 5 - H-bond donors ≤ 5 - H-bond acceptors ≤ 10

def check_lipinski(mol: Chem.Mol) -> dict:
    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Descriptors.NumHDonors(mol)
    hba = Descriptors.NumHAcceptors(mol)

    violations = []
    if mw > 500: violations.append("MW > 500")
    if logp > 5: violations.append("LogP > 5")
    if hbd > 5: violations.append("HBD > 5")
    if hba > 10: violations.append("HBA > 10")

    return {
        "passes": len(violations) <= 1,  # allows 1 violation
        "violations": violations,
        "num_violations": len(violations),
        "properties": {"mw": mw, "logp": logp, "hbd": hbd, "hba": hba},
    }

Veber Rules (Oral Bioavailability)

  • Rotatable bonds ≤ 10
  • TPSA ≤ 140 Ų

PAINS Filter (Pan-Assay Interference Compounds)

Flags molecules with substructures known to give false positives in biochemical assays:

from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams

def check_pains(mol: Chem.Mol) -> dict:
    params = FilterCatalogParams()
    params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
    catalog = FilterCatalog(params)

    matches = []
    entry = catalog.GetFirstMatch(mol)
    while entry:
        matches.append(entry.GetDescription())
        entry = catalog.GetFirstMatch(mol)  # simplified; iterate properly
        break  # prevent infinite loop in this example

    return {"passes": len(matches) == 0, "alerts": matches}

QED (Quantitative Estimate of Drug-likeness)

A continuous score from 0 to 1 integrating 8 desirability functions:

from rdkit.Chem import QED

def compute_qed(mol: Chem.Mol) -> float:
    return round(QED.qed(mol), 3)

SA Score (Synthetic Accessibility)

Predicts ease of synthesis on a scale of 1 (easy) to 10 (hard):

from rdkit.Contrib.SA_Score import sascorer

def compute_sa_score(mol: Chem.Mol) -> float:
    return round(sascorer.calculateScore(mol), 2)

Interpretation: - 1–3: Easy to synthesize - 3–5: Moderate difficulty - 5–7: Challenging - 7–10: Very difficult / likely not synthesizable

Comprehensive Report

def full_druglikeness_report(mol: Chem.Mol) -> dict:
    return {
        "lipinski": check_lipinski(mol),
        "veber": check_veber(mol),
        "pains": check_pains(mol),
        "qed": compute_qed(mol),
        "sa_score": compute_sa_score(mol),
    }

8. SA Score Setup

The SA Score module (sascorer.py) requires a pre-computed fragment frequency file (fpscores.pkl.gz). This file contains fragment frequencies from ~1M PubChem molecules.

Setup

# In chemlib/chemistry/sa_score.py — wrapper
import os
from rdkit.Contrib.SA_Score import sascorer

# The fpscores file is auto-loaded by sascorer from its own directory.
# If needed, set the path explicitly:
# sascorer._fscores = pickle.load(gzip.open('path/to/fpscores.pkl.gz'))

Alternative: Copy sascorer.py Into Project

Since rdkit.Contrib is not always on the Python path (depends on installation method), it may be practical to copy sascorer.py and fpscores.pkl.gz into the project under chemlib/chemistry/vendor/.