smahbub commited on
Commit
d6d05ed
·
verified ·
1 Parent(s): 6598e0b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -77
README.md CHANGED
@@ -4,98 +4,152 @@ base_model:
4
  license: other
5
  ---
6
 
 
7
  # Protein Inverse Folding
8
  Protein inverse folding represents a computational technique aimed at generating protein sequences that will fold into specific three-dimensional structures. The central challenge in protein inverse folding involves identifying sequences capable of reliably adopting the intended structure. In our research, we concentrate on designing sequences based on the known backbone structure of a protein, represented with 3D coordinates of the atoms of the backbone (without any information about what the individual amino-acids are). Specifically. we finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benchmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
9
 
10
- #### Setup:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
12
  - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
13
  - Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
14
- - Here is an example bash script to set up and access a docker container:
15
- ```
16
- # clone the ModelGenerator repository
17
- git clone https://github.com/genbio-ai/ModelGenerator.git
18
- # cd to "ModelGenerator" folder where you should find the "Dockerfile"
19
- cd ModelGenerator
20
- # create a docker image
21
- docker build -t aido .
22
- # create a local folder as ModelGenerator's data directory
23
- mkdir -p $HOME/mgen_data
24
- # run a container
25
- docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
26
- # find the container ID
27
- docker ps # this will print the running containers and their IDs
28
- # execute the container with ID=<container_id>
29
- docker exec -it <container_id> /bin/bash # now you should be inside the docker container
30
- # test if you can access the nvidia GPUs
31
- nvidia-smi # this should print the GPUs' details
32
- ```
 
33
  - Execute the following steps from **within** the docker container you just created.
34
  - **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.
35
 
36
- #### Download and merge model checkpoint chunks:
37
 
38
- - Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
39
 
40
- **Alternatively**, you can do this by simply running the following script:
41
- ```
42
- mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/
43
- huggingface-cli download genbio-ai/AIDO.ProteinIF-16B \
44
  --repo-type model \
45
  --local-dir ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/
46
- # Merge chunks
47
- python merge_ckpt.py ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model.ckpt
48
- ```
 
 
49
 
50
- #### Download data:
51
- - Download the preprocessed CATH 4.2 dataset from [here](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/tree/main/cath-4.2). You should find two files named [chain_set_map.pkl](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_map.pkl) and [chain_set_splits.json](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_splits.json). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`. (Note that it was originally preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf), and we further preprocessed it to suit our pipeline.)
 
52
 
53
- **Alternatively**, you can do it by simply running the following script:
54
- ```
55
- mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
56
- huggingface-cli download genbio-ai/protein-inverse-folding \
 
57
  --repo-type dataset \
58
  --local-dir ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold
59
- ```
60
-
61
- #### Run inference:
62
- - From your terminal, change directory to `experiments/AIDO.Protein/protein_inverse_folding` folder and run the following script:
63
- ```
64
- cd experiments/AIDO.Protein/protein_inverse_folding
65
- # Run inference
66
- mgen test --config protein_inv_fold_test.yaml \
67
- --trainer.default_root_dir ${MGEN_DATA_DIR}/modelgenerator/logs/protein_inv_fold/ \
68
- --ckpt_path ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model.ckpt \
69
- --trainer.devices 0, \
70
- --data.path ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
71
- ```
72
-
73
- #### Outputs:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  - The evaluation score will be printed on the console.
75
- - The generated sequences will be stored the folder `proteinIF_outputs/`. There will be two output files:
76
- - **`./proteinIF_outputs/designed_sequences.pkl`**: This file will contain the raw token (amino-acid) IDs of the ground truth sequences (`"true_seq"`) and predicted sequences by our method (`"pred_seq"`), stored as numpy arrays. An example:
77
- ```
78
- {
79
- 'true_seq': [
80
- array([[ 4, 8, 4, 3, 12, 5, 2, 11, 16, 15, 5, 1, 11, ...]]), ...
81
- ],
82
- 'pred_seq': [
83
- array([[ 8, 2, 4, 3, 10, 6, 2, 11, 16, 15, 6, 1, 11, ...]]), ...
84
- ]
85
- }
86
- ```
87
- - **`./proteinIF_outputs/results_acc_<median_accuracy>.txt`** (where median accuracy is the median accuracy calculated over all the test samples):
88
- - Here, for each protein in the test set, we have three lines of information:
89
- - Line1: Identity of the protein (as '`name=<PDB_ID>.<CHAIN_ID>`'), length of the squence (as '`L=<length_of_sequence>`'), and the recovery rate/accuracy for that protein sequence (as '`Recovery=<recovery_rate_of_sequence>`')
90
- - Line2: *Single-letter representation* of amino-acids of the ground truth sequences (as `true:<sequence_of_amino_acids>`)
91
- - Line3: *Single-letter representation* of amino-acids of the predicted sequences by our method (as `pred:<sequence_of_amino_acids>`)
92
- - An example file content:
93
- ```
94
- >name=3fkf.A | L=141 | Recovery=0.5957446694374084
95
- true:VTVGKSAPYFSLPNEKGEKLSRSAERFRNRYLLLNFWASWCDPQPEANAELKRLNKEYKKNKNFAMLGISLDIDREAWETAIKKDTLSWDQVCDFTGLSSETAKQYAILTLPTNILLSPTGKILARDIQGEALTGKLKELL
96
- pred:TAVGDEAPYFELPDLEGKKLSLDSEEFKNKYLLLDFWASWCLPCREEIAELKELYRRFAKNKKFAILGVSADTDKEAWLKAVKEDNLRWTQVSDFKGWDSEVFKNYNVQSLPENILLSPEGKILARGIRGEALRNKLKELL
97
-
98
- >name=2d9e.A | L=121 | Recovery=0.7685950398445129
99
- true:GSSGSSGFLILLRKTLEQLQEKDTGNIFSEPVPLSEVPDYLDHIKKPMDFFTMKQNLEAYRYLNFDDFEEDFNLIVSNCLKYNAKDTIFYRAAVRLREQGGAVLRQARRQAEKMGSGPSSG
100
- pred:GSSGSSGRLTLLRETLEQLQERDTGWVFSEPVPLSEVPDYLDVIDHPMDFSTMRRKLEAHRYLSFDEFERDFNLIVENCRKYNAKDTVFYRAAVRLQAQGGAILRKARRDVESLGSGPSSG
101
- ```
 
 
 
 
 
 
 
 
4
  license: other
5
  ---
6
 
7
+
8
  # Protein Inverse Folding
9
  Protein inverse folding represents a computational technique aimed at generating protein sequences that will fold into specific three-dimensional structures. The central challenge in protein inverse folding involves identifying sequences capable of reliably adopting the intended structure. In our research, we concentrate on designing sequences based on the known backbone structure of a protein, represented with 3D coordinates of the atoms of the backbone (without any information about what the individual amino-acids are). Specifically. we finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benchmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
10
 
11
+ ##### Experimental Results
12
+ We present an evaluation of various inverse folding techniques using the [CATH-4.2 dataset](https://pubmed.ncbi.nlm.nih.gov/9309224/).
13
+ AIDO.Protein is compared against current state-of-the-art models across three experiment settings inspired by LM-Design:
14
+ (a) sequences with fewer than 100 residues (short chains),
15
+ (b) single-chain proteins (those represented by a single entry in CATH 4.2), and
16
+ (c) the full CATH 4.2 dataset.
17
+ We take DPLM's scores from [their paper](https://arxiv.org/abs/2402.18567) and the other scores (except AIDO.ProteinIF) were taken from [LM-Design's paper](https://arxiv.org/abs/2302.01649).
18
+ The highest and second-highest scores are highlighted in bold and italic, respectively.
19
+ A dash (-) indicates that the corresponding score was not available in the original source.
20
+
21
+ | **Models** | **Short-chains - PPL ↓** | **Short-chains - MSRR ↓** | **Single-chains - PPL ↓** | **Single-chains - MSRR ↓** | **Full - PPL ↓** | **Full - MSRR ↓** |
22
+ |-----------------------|------------------|------------------|-------------------|------------------|----------------|------------------|
23
+ | GVP | 7.23 | 30.60 | 7.84 | 28.95 | 5.36 | 39.47 |
24
+ | ProteinMPNN | 6.21 | 36.35 | 6.68 | 34.43 | 4.61 | 45.96 |
25
+ | ProteinMPNN-CMLM | 7.16 | 35.42 | 7.25 | 35.71 | 5.03 | 48.62 |
26
+ | PiFold | *6.04* | **39.84** | *6.31* | 38.53 | 4.55 | 51.66 |
27
+ | LM-Design | 7.01 | 35.19 | 6.58 | *40.00* | *4.41* | 54.41 |
28
+ | DPLM | - | - | - | - | - | *54.54* |
29
+ | **AIDO.ProteinIF** | **4.29** | *38.46* | **3.18** | **58.87** | **3.20** | **58.60** |
30
+
31
+
32
+
33
+ #
34
+
35
+ In the following sections, we discuss how to use AIDO.ProteinIF for inference and testing on CATH 4.2 using ModelGenerator.
36
+
37
+ #### Setup
38
  Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
39
  - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
40
  - Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
41
+
42
+ Here is an example bash script to set up and access a docker container:
43
+ ```
44
+ # clone the ModelGenerator repository
45
+ git clone https://github.com/genbio-ai/ModelGenerator.git
46
+ # cd to "ModelGenerator" folder where you should find the "Dockerfile"
47
+ cd ModelGenerator
48
+ # create a docker image
49
+ docker build -t aido .
50
+ # create a local folder as ModelGenerator's data directory
51
+ mkdir -p $HOME/mgen_data
52
+ # run a container
53
+ docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
54
+ # find the container ID
55
+ docker ps # this will print the running containers and their IDs
56
+ # execute the container with ID=<container_id>
57
+ docker exec -it <container_id> /bin/bash # now you should be inside the docker container
58
+ # test if you can access the nvidia GPUs
59
+ nvidia-smi # this should print the GPUs' details
60
+ ```
61
  - Execute the following steps from **within** the docker container you just created.
62
  - **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.
63
 
64
+ #### Download and merge model checkpoint chunks
65
 
66
+ Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks` and merge them.
67
 
68
+ You can do this by simply running the following script:
69
+ ```
70
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/
71
+ huggingface-cli download genbio-ai/AIDO.ProteinIF-16B \
72
  --repo-type model \
73
  --local-dir ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/
74
+ # change directory to the folder: /workspace/experiments/AIDO.Protein/protein_inverse_folding/
75
+ cd /workspace/experiments/AIDO.Protein/protein_inverse_folding/
76
+ # Merge chunks
77
+ python merge_ckpt.py ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model.ckpt
78
+ ```
79
 
80
+ #### Download data
81
+ ##### For inference on CATH 4.2:
82
+ Download the preprocessed CATH 4.2 dataset from [here](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/tree/main/cath-4.2). You should find two files named [chain_set_map.pkl](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_map.pkl) and [chain_set_splits.json](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_splits.json). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`. (Note that it was originally preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf), and we further preprocessed it to suit our pipeline.)
83
 
84
+ **Alternatively**, you can do it by simply running the following script:
85
+ ```
86
+ DATA_DIR=${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2
87
+ mkdir -p ${DATA_DIR}/
88
+ huggingface-cli download genbio-ai/protein-inverse-folding \
89
  --repo-type dataset \
90
  --local-dir ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold
91
+ ```
92
+
93
+ ##### For inference on a protein from [PDB](https://www.rcsb.org/):
94
+ First download a single 3D structure (PDB/CIF file) from PDB:
95
+ ```
96
+ DATA_DIR=${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/custom_data ## The directory where you want to download the PDB/CIF file. Feel free to change.
97
+ PDB_ID=5YH2 ## example protein's PDB ID
98
+ CHAIN_ID=A ## example protein's CHAIN ID
99
+ mkdir -p ${DATA_DIR}/
100
+ wget https://files.rcsb.org/download/${PDB_ID}.cif -P ${DATA_DIR}/
101
+ ```
102
+
103
+ Then put it into our format:
104
+ ```
105
+ python preprocess_PDB.py ${DATA_DIR}/${PDB_ID}.cif ${CHAIN_ID} ${DATA_DIR}/
106
+ ```
107
+
108
+ #### Run inference
109
+ From your terminal, change directory to `/workspace/experiments/AIDO.Protein/protein_inverse_folding` folder and run the following script:
110
+ ```
111
+ cd /workspace/experiments/AIDO.Protein/protein_inverse_folding/
112
+ # Run inference
113
+ mgen test --config protein_inv_fold_test.yaml \
114
+ --trainer.default_root_dir ${MGEN_DATA_DIR}/modelgenerator/logs/protein_inv_fold/ \
115
+ --ckpt_path ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model.ckpt \
116
+ --trainer.devices 0, \
117
+ --data.path ${DATA_DIR}/
118
+ ```
119
+
120
+ #### Outputs
121
  - The evaluation score will be printed on the console.
122
+ - The generated sequences will be stored in the folder `./proteinIF_outputs/`. There will be two output files:
123
+ 1. **`designed_sequences.pkl`**,
124
+ 2. **`results_acc_<median_accuracy>.txt`** (where median accuracy is the median accuracy calculated over all the test samples)
125
+
126
+ The contents of the files are described below:
127
+
128
+ ##### Output file 1: **`designed_sequences.pkl`**
129
+ This file will contain the raw token (amino-acid) IDs of the ground truth sequences (`"true_seq"`) and predicted sequences by our method (`"pred_seq"`), stored as numpy arrays. An example:
130
+ ```
131
+ {
132
+ 'true_seq': [
133
+ array([[ 4, 8, 4, 3, 12, 5, 2, 11, 16, 15, 5, 1, 11, ...]]), ...
134
+ ],
135
+ 'pred_seq': [
136
+ array([[ 8, 2, 4, 3, 10, 6, 2, 11, 16, 15, 6, 1, 11, ...]]), ...
137
+ ]
138
+ }
139
+ ```
140
+ ##### Output file 2: **`results_acc_<median_accuracy>.txt`**
141
+ Here, for each protein in the test set, we have three lines of information:
142
+ - **Line1**: Identity of the protein (as '`name=<PDB_ID>.<CHAIN_ID>`'), length of the squence (as '`L=<length_of_sequence>`'), and the recovery rate/accuracy for that protein sequence (as '`Recovery=<recovery_rate_of_sequence>`')
143
+ - **Line2**: *Single-letter representation* of amino-acids of the ground truth sequences (as `true:<sequence_of_amino_acids>`)
144
+ - **Line3**: *Single-letter representation* of amino-acids of the predicted sequences by our method (as `pred:<sequence_of_amino_acids>`)
145
+
146
+ An example file content:
147
+ ```
148
+ >name=3fkf.A | L=141 | Recovery=0.5957446694374084
149
+ true:VTVGKSAPYFSLPNEKGEKLSRSAERFRNRYLLLNFWASWCDPQPEANAELKRLNKEYKKNKNFAMLGISLDIDREAWETAIKKDTLSWDQVCDFTGLSSETAKQYAILTLPTNILLSPTGKILARDIQGEALTGKLKELL
150
+ pred:TAVGDEAPYFELPDLEGKKLSLDSEEFKNKYLLLDFWASWCLPCREEIAELKELYRRFAKNKKFAILGVSADTDKEAWLKAVKEDNLRWTQVSDFKGWDSEVFKNYNVQSLPENILLSPEGKILARGIRGEALRNKLKELL
151
+
152
+ >name=2d9e.A | L=121 | Recovery=0.7685950398445129
153
+ true:GSSGSSGFLILLRKTLEQLQEKDTGNIFSEPVPLSEVPDYLDHIKKPMDFFTMKQNLEAYRYLNFDDFEEDFNLIVSNCLKYNAKDTIFYRAAVRLREQGGAVLRQARRQAEKMGSGPSSG
154
+ pred:GSSGSSGRLTLLRETLEQLQERDTGWVFSEPVPLSEVPDYLDVIDHPMDFSTMRRKLEAHRYLSFDEFERDFNLIVENCRKYNAKDTVFYRAAVRLQAQGGAILRKARRDVESLGSGPSSG
155
+ ```