metadata

license: mit
datasets:
  - ZINC-22
language:
  - en
tags:
  - molecular-generation
  - drug-discovery
  - llama
  - flash-attention
pipeline_tag: text-generation

NovoMolGen

NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC‑22 molecules using Llama architectures and FlashAttention. It achieves state‑of‑the‑art performance on both unconstrained and goal‑directed molecule generation tasks.

How to load

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)

Quickstart

outputs = model.sample(tokenizer=tokenizer, batch_size=4)
print(outputs['SMILES'])

Citation

@article{chitsaz2024novomolgen,
  title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
  author={Chitsaz, Kamran and Balaji, Roshan and Fournier, Quentin and Bhatt, Nirav Pravinbhai and Chandar, Sarath},
  journal={arXiv preprint},
  year={2025},
}