Drug Metabolism Prediction with Molecular Transformers

Drug Metabolism Prediction with Molecular Transformers introduces an optimized sequence-to-sequence architecture designed to predict drug metabolism with high accuracy and low computational overhead.

By framing xenobiotic biotransformations as a machine translation task using SMILES strings, this project integrates pre-training, transfer learning, and selective fine-tuning to overcome the traditional limitations of rule-based models and standard Transformers.

Drug Discovery Sequence-to-Sequence Optimized Efficiency

The Challenge of Metabolism Prediction

Metabolism dictates the efficacy and toxicity of potential drugs. While traditional computational methods rely on rigid rules to predict Sites of Metabolism (SoMs), they often fail to generalize across diverse enzymatic classes (such as CYP450 phase I reactions versus phase II transferases).

Deep Learning models offer a data-driven alternative by “translating” a substrate’s SMILES string directly into its metabolite. However, standard Molecular Transformers face two major hurdles:

Low Validity: A tendency to generate syntactically invalid SMILES strings that do not represent real chemical molecules.
High Computational Cost: Training massive attention-based architectures from scratch is extremely resource-intensive.

Key Innovations

This project solves these bottlenecks by separating the learning of SMILES syntax from the actual metabolism prediction task:

SMILES Canonicalization Pre-training: The model is first pre-trained on a massive dataset (ZINC20) to translate randomized SMILES into canonical SMILES. This forces the encoder to learn the underlying molecular geometry rather than just memorizing character sequences.
Frozen-Decoder Fine-Tuning: During the primary metabolic training phase, the decoder is frozen. Only 4.5 million out of 13.5 million parameters are updated, drastically reducing training time and preventing catastrophic forgetting of the SMILES syntax.
Dual-Input Strategy: Combines the structural SMILES input with a specific Reaction Class (RC) token to provide vital chemical context, improving predictive accuracy.

Drug Discovery

Accelerates early-stage drug development by rapidly profiling the metabolic viability and potential toxicity of candidate molecules.

Sequence-to-Sequence

Treats complex biochemical reactions as a machine translation problem, converting substrate SMILES into metabolite SMILES.

Optimized Efficiency

Uses strategic Transfer Learning and learning rate decay to maximize chemical validity while minimizing GPU compute time.

Request Information

Core Publication

2025

Improving the Efficiency and the Validity of Molecular Transformers

Leone Bacciu, Matteo Grazioso, Silvia Multari, and 3 more authors

In 2025 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2025

DOI

2024

Predicting metabolic reactions with a Molecular Transformer for drug design optimization

Silvia Multari, Rıza Özçelik, Angelica Mazzolari, and 2 more authors

In 2024 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2024

DOI