chandar-lab
/

NovoMolGen_300M_SMILES_BPE

Text Generation

molecular-generation

flash-attention

Model card Files Files and versions

NovoMolGen_300M_SMILES_BPE / README.md

kmchiti's picture

Update model on main, checkpoint

37077c6 verified 3 months ago

|

1.16 kB

	---
	license: mit
	datasets:
	- ZINC-22
	language:
	- en
	tags:
	- molecular-generation
	- drug-discovery
	- llama
	- flash-attention
	pipeline_tag: text-generation
	---

	# NovoMolGen

	NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC‑22 molecules using Llama architectures and FlashAttention. It achieves state‑of‑the‑art performance on both unconstrained and goal‑directed molecule generation tasks.

	## How to load

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_300M_SMILES_BPE", trust_remote_code=True)
	```

	## Quickstart

	```python
	outputs = model.sample(tokenizer=tokenizer, batch_size=4)
	print(outputs['SMILES'])
	```

	## Citation

	```bibtex
	@article{chitsaz2024novomolgen,
	title={NovoMolGen: Rethinking Molecular Language Model Pretraining},
	author={Chitsaz, Kamran and Balaji, Roshan and Fournier, Quentin and Bhatt, Nirav Pravinbhai and Chandar, Sarath},
	journal={arXiv preprint},
	year={2025},
	}
	```