fabikru/half-of-chembl-2025-randomized-smiles-cleaned
Viewer • Updated • 1.21M • 7
How to use fabikru/MolEncoder with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="fabikru/MolEncoder") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("fabikru/MolEncoder")
model = AutoModelForMaskedLM.from_pretrained("fabikru/MolEncoder")MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules".
Please refer to the MolEncoder GitHub repository for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.
If you use this model, please cite:
@Article{D5DD00369E,
author ="Krüger, Fabian P. and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor",
title ="MolEncoder: towards optimal masked language modeling for molecules",
journal ="Digital Discovery",
year ="2025",
pages ="-",
publisher ="RSC",
doi ="10.1039/D5DD00369E",
url ="http://dx.doi.org/10.1039/D5DD00369E"}