File size: 13,131 Bytes
653ba71 447eb1c 653ba71 73342ca 3f2e8af |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
license: apache-2.0
datasets:
- katielink/moleculenet-benchmark
tags:
- biology
- chemistry
---
**This model was uploaded by Hugging Face staff. The model card has been copied from the [Github repository](https://github.com/IBM/molformer).**
# MoLFormer
**MoLFormer** is a large-scale chemical language model designed with the intention of learning a model trained on small molecules which are represented as SMILES strings. MoLFormer leverges Masked Language Modeling and employs a linear attention Transformer combined with rotary embeddings.
![MoLFormer](https://media.github.ibm.com/user/4935/files/594363e6-497b-4b91-9493-36ed46f623a2)
An overview of the MoLFormer pipeline is seen in the image above. One can see that the transformer based neural network model is trained on a large collection of chemical molecules represented by SMILES sequences from two public chemical datasets PubChem and Zinc in a self-supervised fashion. The MOLFORMER architecture was designed with an efficient linear attention mechanism and relative positional embeddings with the goal of learning a meaningful and compressed representation of chemical molecules. After training the MOLFORMER foundation model was then adopted to different downstream molecular property prediction tasks via fine-tuning on task-specific data. To further test the representative power of MOLFORMER the MOLFORMER encodings were used to recover molecular similarity, and analysis on the correspondence between the interatomic spatial distance and attention value for a given molecule was performed.
1. [Getting Started](#getting-started)
1. [Pretrained Models and training logs](#pretrained-models-and-training-logs)
2. [Replicating Conda Environment](#replicating-conda-environment)
2. [Data](#data)
1. [Pretraining Datasets](#pretraining-datasets)
2. [Finetuning Datasets](#finetuning-datasets)
3. [Pretraining](#pretraining)
4. [Finetuning](#finetuning)
5. [Feature extraction](#feature-extraction)
6. [Attention Visualization Analysis](#attention-visualization-analysis)
7. [Citations](#citatiobs)
## Getting Started
**This Code and Environment have been tested on Nvidia V100s**
#### Pretrained Models and training logs
If Training from scratch the resulting Pretrained models and associated training logs will be located in the /data directory in the following hierarchy.
```
data/
βββ Pretrained MoLFormer
β βββ checkpoints
β β βββ N-Step-Checkpoint_0_0.ckpt
β β βββ N-Step-Checkpoint_0_5000.ckpt
β β βββ N-Step-Checkpoint_1_10000.ckpt
β β βββ N-Step-Checkpoint_1_15000.ckpt
β β βββ N-Step-Checkpoint_2_20000.ckpt
β β βββ N-Step-Checkpoint_3_25000.ckpt
β β βββ N-Step-Checkpoint_3_30000.ckpt
β βββ events.out.tfevents.1643396916.cccxc543.3427421.0
β βββ hparams.yaml
βββ checkpoints
β βββ linear_model.ckpt
β βββ full_model.ckpt
βββ Full_Attention_Rotary_Training_Logs
β βββ events.out.tfevents.1628698179.cccxc544.604661.0
β βββ hparams.yaml
βββ Linear_Rotary_Training_Logs
βββ events.out.tfevents.1620915522.cccxc406.63025.0
βββ hparams.yaml
```
We are providing checkpoints of a MoLFormer model pre-trained on a dataset of ~100M molecules. This dataset combines 10% of Zinc and 10% of PubChem molecules used for MoLFormer-XL training. The accompanying pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet. (see Extended data Tables 1-2 in [https://arxiv.org/abs/2106.09553](https://arxiv.org/abs/2106.09553)). These checkpoints are available at [https://ibm.box.com/v/MoLFormer-data](https://ibm.box.com/v/MoLFormer-data)
#### Replicating Conda Environment
Due to the use of apex.optimizers in our code, Apex must be compiled from source. Step-by-step directions are provided in [environment.md](environment.md)
## Data
Datasets are available at [https://ibm.box.com/v/MoLFormer-data](https://ibm.box.com/v/MoLFormer-data)
### PreTraining Datasets
Due to the large nature of the combination of the PubChem and Zinc (over 1.1 billion molecules in total) datasets the code expects the data to be in a certain location and format. The details of the of this processing is documented below for each individaul dataset.
The code expects both the zinc15(ZINC) and pubchem datasets to be located in ```./data/``` directory of the training diretory.
* Zinc15 itself should be in located ```data/ZINC/``` and is expected to be processed in multiple smi files which contains one smiles string per line.
* PubChem should be located in ```data/pubchem/``` and is expected to be processed as a single βCID-SMILESβ text file with 2 columns (index and smiles string). We took the raw Pubchem dataset and converted every smiles molecule into the canonical form, utilizing rdkit, as well as trimmed down the file itself. Our dataloader expects Pubchem to be in our converted form and will not run on the raw pubchem file.
```
data/
βββ pubchem
β βββ CID-SMILES-CANONICAL.smi
βββ ZINC
βββ AAAA.smi
βββ AAAB.smi
βββ AAAC.smi
βββ AAAD.smi
βββ AABA.smi
βββ AABB.smi
βββ AABD.smi
βββ AACA.smi
βββ AACB.smi
βββ AAEA.smi
βββ AAEB.smi
βββ AAED.smi
βββ ABAA.smi
βββ ABAB.smi
βββ ABAC.smi
βββ ABAD.smi
βββ ABBA.smi
βββ ABBB.smi
βββ ABBD.smi
βββ ABCA.smi
βββ ABCB.smi
βββ ABCD.smi
βββ ABEA.smi
βββ ABEB.smi
βββ ABEC.smi
βββ ABED.smi
βββ ACAA.smi
βββ ACAB.smi
```
### Finetuning Datasets
Just as with the pretraining data the code expects the finetuning datasets to be in the following hierarchy. These datasets were provided in the finetune_datasets.zip
```
data/
βββ bace
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ bbbp
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ clintox
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ esol
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ freesolv
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ hiv
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ lipo
β βββ lipo_test.csv
β βββ lipo_train.csv
β βββ lipo_valid.csv
βββ qm9
β βββ qm9.csv
β βββ qm9_test.csv
β βββ qm9_train.csv
β βββ qm9_valid.csv
βββ sider
β βββ test.csv
β βββ train.csv
β βββ valid.csv
βββ tox21
βββ test.csv
βββ tox21.csv
βββ train.csv
βββ valid.csv
```
## Pretraining
For pre-training we use the masked language model method to train the model from scratch.
MoLFormer is pre-trained on canonicalized SMILES of >1 B molecules from ZINC and PubChem with the following constraints:
During pre-processing, the compounds are filtered to keep a maximum length of 211 characters. A 100/0/0 split was used for training, validation, and test, i.e. we used all the data for training the model. As a confidence test we would evaluate the model at the end of each epoch on the following data (find the data we used for eval). Data canonicalization was performed using RDKit.
The pre-training code provides an example of data processing and training of a model trained on a smaller pre-training dataset size, which requires 16 v100 GPUs. The remainder of this README contains an installation guide for this repo, descriptions and links to pre-training and fine-tuning datasets, configuration files and python codes for model pre-training and fine-tuning, and jupyter notebook for attention map visualization and analysis for a given molecule. A MoLFormer instance pre-trained on xxx data is also provided.
To train a model run:
> bash run_pubchem_light.sh
## Finetuning
The finetuning related dataset and environment can be found in [finetuning datasets](finetuning_datasets) and [environment.md](environment.md) respectively. Once you have the environment set up, you can run a fine-tune task by running
> bash run_finetune_mu.sh
Finetuning training/checkpointing resources will be available in the diretory named ```checkpoint_<measure_name>```. The path to the results csv will be in the form of ```./checkpoint_<measure_name>/<measure_name>/results/results_.csv``` The ```results_.csv``` file contains 4 columns of data. Column one contains the validation score for each epoch while column 2 contains the test score for each epoch. Column 3 contains the best validation score observed up to that point of fine tuning while column 4 is the test score of the epoch which had the best validation score.
## Feature Extraction
The notebook [frozen_embeddings_classification.ipynb](notebooks/pretrained_molformer/frozen_embeddings_classification.ipynb) contains code needed to load the [checkpoint files](https://ibm.box.com/v/MoLFormer-data) and use the pre-trained model as a feature extractor for a simple classification task.
Download the `Pretrained MoLFormer.zip` and `finetune_datasets.zip` and extract them to the `data/` folder. Follow the instructions in [environment.md](environment.md) to install all dependencies and then run the notebook.
## Attention Visualization Analysis
The `notebooks` directory provide attention visualization for two setup with Rotary Embeddings:
- **Linear attention** (./notebooks/full_attention_rotary/attention_analysis_rotary_full.ipynb)
- **Full attention** (./notebooks/linear_attention_rotary/attention_analysis_rotary_linear.ipynb)
The checkpoints required for the above models are to be placed in `./data/checkpoints`
## Citations
```
@article{10.1038/s42256-022-00580-7,
year = {2022},
title = {{Large-scale chemical language representations capture molecular structure and properties}},
author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
journal = {Nature Machine Intelligence},
doi = {10.1038/s42256-022-00580-7},
abstract = {{Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.}},
pages = {1256--1264},
number = {12},
volume = {4}
}
```
```
@misc{https://doi.org/10.48550/arxiv.2106.09553,
doi = {10.48550/ARXIV.2106.09553},
url = {https://arxiv.org/abs/2106.09553},
author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
publisher = {arXiv},
year = {2021},
copyright = {arXiv.org perpetual, non-exclusive license}
}
``` |