license: mit | |
datasets: | |
- ZINC-22 | |
language: | |
- en | |
tags: | |
- molecular-generation | |
- drug-discovery | |
- llama | |
- flash-attention | |
pipeline_tag: text-generation | |
# NovoMolGen | |
NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC‑22 molecules using Llama architectures and FlashAttention. It achieves state‑of‑the‑art performance on both unconstrained and goal‑directed molecule generation tasks. | |
## How to load | |
```python | |
from transformers import AutoTokenizer, AutoModelForCausalLM | |
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True) | |
model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True) | |
``` | |
## Quickstart | |
```python | |
outputs = model.sample(tokenizer=tokenizer, batch_size=4) | |
print(outputs['SMILES']) | |
``` | |
## Citation | |
```bibtex | |
@article{chitsaz2024novomolgen, | |
title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, | |
author={Chitsaz, Kamran and Balaji, Roshan and Fournier, Quentin and Bhatt, Nirav Pravinbhai and Chandar, Sarath}, | |
journal={arXiv preprint}, | |
year={2025}, | |
} | |
``` |