smahbub commited on
Commit
57310f0
·
verified ·
1 Parent(s): a60f9ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -10
README.md CHANGED
@@ -8,21 +8,52 @@ license: other
8
  We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
9
 
10
  #### Setup:
11
- Install [Model Generator](https://github.com/genbio-ai/modelgenerator).
12
  - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
13
- - Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
14
-
15
- #### Running inference:
16
-
17
- - Set the environment variable for ModelGenerator's data directory (**Note:** the docker image with our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) will already have it set):
18
  ```
19
- export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice, if you would like to change it inside [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ```
21
- - Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
 
 
 
 
 
 
 
 
 
 
22
 
23
- - Download the CATH 4.2 dataset preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf) from [here](http://people.csail.mit.edu/ingraham/graph-protein-design/data/cath/). You should find two files named `chain_set.jsonl` and `chain_set_splits.json`. Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`.
 
 
 
 
 
 
 
 
24
 
25
- - Then run the bash script:
 
26
  ```
27
  bash prot_inverse_folding.sh
28
  ```
 
8
  We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
9
 
10
  #### Setup:
11
+ Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
12
  - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
13
+ - Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
14
+ - Here is an example bash script to set up and access a docker container:
 
 
 
15
  ```
16
+ # clone the ModelGenerator repository
17
+ git clone https://github.com/genbio-ai/ModelGenerator.git
18
+ # cd to "ModelGenerator" folder where you should find the "Dockerfile"
19
+ cd ModelGenerator
20
+ # create a docker image
21
+ docker build -t aido .
22
+ # create a local folder as ModelGenerator's data directory
23
+ mkdir -p $HOME/mgen_data
24
+ # run a container
25
+ docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
26
+ # find the container ID
27
+ docker ps # this will print the running containers and their IDs
28
+ # execute the container with ID=<container_id>
29
+ docker exec -it <container_id> /bin/bash # now you should be inside the docker container
30
+ # test if you can access the nvidia GPUs
31
+ nvidia-smi # this should print the GPUs' details
32
  ```
33
+ - Execute the following steps from **within** the docker container you just created.
34
+
35
+ #### Download model checkpoints:
36
+
37
+ - Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
38
+
39
+ **Alternatively**, you can simply run the following script to do this (Note: this script uses the [wget](https://www.gnu.org/software/wget/) tool):
40
+ ```
41
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
42
+ bash download_model_chunks.sh ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
43
+ ```
44
 
45
+ #### Download data:
46
+ - Download the preprocessed CATH 4.2 dataset from [here](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/tree/main/cath-4.2). You should find two files named [chain_set_map.pkl](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_map.pkl) and [chain_set_splits.json](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_splits.json). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`. (Note that it was originally preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf), and we further preprocessed it to suit our pipeline.)
47
+
48
+ **Alternatively**, you can do it by simply running the following script:
49
+ ```
50
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
51
+ wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_map.pkl
52
+ wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_splits.json
53
+ ```
54
 
55
+ #### Run inference:
56
+ - Then run the bash script for inference:
57
  ```
58
  bash prot_inverse_folding.sh
59
  ```