ChemBERTa-druglike: Two-phase MLM Pretraining for Drug-like SMILES
Model Description
This model is a ChemBERTa model specifically designed for downstream molecular property prediction and embedding-based similarity tasks on drug-like molecules.
Training Procedure
The model was pretrained using a two-phase curriculum learning strategy, which increases the complexity of the pretraining task. The first phase uses a simpler dataset with a lower masking probability, while the second phase uses a more complex dataset with a higher masking probability. This approach allows the model to learn robust representations of drug-like molecules while gradually adapting to more challenging tasks.
Phase 1 – “easy” pretraining
- Dataset: augmented_canonical_druglike_QED_43M
- Masking probability: 15%
- Training duration: 9 epochs (chosen due to loss plateauing)
- Training procedure: Following established ChemBERTa and ChemBERTa-2 methodologies
Phase 2 – “advanced” pretraining
- Dataset: druglike dataset
- Masking probablity: 40%
- Training duration: Until early stopping callback triggered (best validation loss at ~18 000 steps). Further training negatively impacted Chem-MRL evaluation score.
Training Configuration
- Optimizer: NVIDIA Apex's FusedAdam optimizer
- Scheduler: Constant with warmup (10% of steps)
- Batch size: 144 sequences
- Precision: mixed-precision (fp16) and tf32 enabled
Model Objective
This model serves as a specialized backbone for drug-like molecular representation learning, specifically optimized for:
- Molecular similarity tasks
- Drug-like compound analysis
- Chemical space exploration in pharmaceutical contexts
Evaluation
The model's effectiveness was validated through downstream Chem-MRL training on the pubchem_10m_genmol_similarity dataset, measuring Spearman correlation coefficients between transformer embedding similarities and 2048-bit Morgan fingerprint Tanimoto similarities.
W&B report on ChemBERTa-druglike evaluation.
Benchmarks
Classification Datasets (ROC AUC - Higher is better)
Model | BACE↑ | BBBP↑ | TOX21↑ | HIV↑ | SIDER↑ | CLINTOX↑ |
---|---|---|---|---|---|---|
Tasks | 1 | 1 | 12 | 1 | 27 | 2 |
Derify/ChemBERTa-druglike | 0.8114 | 0.7399 | 0.7522 | 0.7527 | 0.6577 | 0.9660 |
Regression Datasets (RMSE - Lower is better)
Model | ESOL↓ | FREESOLV↓ | LIPO↓ | BACE↓ | CLEARANCE↓ |
---|---|---|---|---|---|
Tasks | 1 | 1 | 1 | 1 | 1 |
Derify/ChemBERTa-druglike | 0.8241 | 0.5350 | 0.6663 | 1.0105 | 43.4499 |
Benchmarks were conducted using the chemberta3 framework. Datasets were split with DeepChem’s scaffold splits and filtered to include only molecules with SMILES length ≤128, matching the model’s maximum input length. The ChemBERTa-druglike model was fine-tuned for 100 epochs with a learning rate of 3e-5 and batch size of 32. Each task was run with 3 different random seeds, and the mean performance is reported.
Use Cases
- Molecular property prediction
- Drug discovery and development
- Chemical similarity analysis
Limitations
- Optimized specifically for drug-like molecules
- Performance may vary on non-drug-like chemical compounds
References
ChemBERTa Series
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction},
author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2020},
eprint={2010.09885},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2010.09885},
}
@misc{ahmad2022chemberta2chemicalfoundationmodels,
title={ChemBERTa-2: Towards Chemical Foundation Models},
author={Walid Ahmad and Elana Simon and Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
year={2022},
eprint={2209.01712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2209.01712},
}
@misc{singh2025chemberta3opensource,
title={ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models},
author={Singh, R. and Barsainyan, A. A. and Irfan, R. and Amorin, C. J. and He, S. and Davis, T. and others},
year={2025},
howpublished={ChemRxiv},
doi={10.26434/chemrxiv-2025-4glrl-v2},
note={This content is a preprint and has not been peer-reviewed},
url={https://doi.org/10.26434/chemrxiv-2025-4glrl-v2}
}
- Downloads last month
- 201
Model tree for Derify/ChemBERTa-druglike
Datasets used to train Derify/ChemBERTa-druglike
Evaluation results
- roc_auc on BACEself-reported0.811
- roc_auc on BBBPself-reported0.740
- roc_auc on TOX21self-reported0.752
- roc_auc on HIVself-reported0.753
- roc_auc on SIDERself-reported0.658
- roc_auc on CLINTOXself-reported0.966
- rmse on ESOLself-reported0.824
- rmse on FREESOLVself-reported0.535
- rmse on LIPOself-reported0.666
- rmse on BACEself-reported1.010