File size: 1,162 Bytes
37077c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: mit
datasets:
  - ZINC-22
language:
  - en
tags:
  - molecular-generation
  - drug-discovery
  - llama
  - flash-attention
pipeline_tag: text-generation
---

# NovoMolGen

NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC‑22 molecules using Llama architectures and FlashAttention. It achieves state‑of‑the‑art performance on both unconstrained and goal‑directed molecule generation tasks.

## How to load

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
```

## Quickstart

```python
outputs = model.sample(tokenizer=tokenizer, batch_size=4)
print(outputs['SMILES'])
```

## Citation

```bibtex
@article{chitsaz2024novomolgen,
  title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
  author={Chitsaz, Kamran and Balaji, Roshan and Fournier, Quentin and Bhatt, Nirav Pravinbhai and Chandar, Sarath},
  journal={arXiv preprint},
  year={2025},
}
```