metadata
license: mit
datasets:
- ZINC-22
language:
- en
tags:
- molecular-generation
- drug-discovery
- llama
- flash-attention
pipeline_tag: text-generation
NovoMolGen
NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC‑22 molecules using Llama architectures and FlashAttention. It achieves state‑of‑the‑art performance on both unconstrained and goal‑directed molecule generation tasks.
How to load
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
Quickstart
outputs = model.sample(tokenizer=tokenizer, batch_size=4)
print(outputs['SMILES'])
Citation
@article{chitsaz2024novomolgen,
title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
author={Chitsaz, Kamran and Balaji, Roshan and Fournier, Quentin and Bhatt, Nirav Pravinbhai and Chandar, Sarath},
journal={arXiv preprint},
year={2025},
}