Update README.md
Browse files
README.md
CHANGED
@@ -8,21 +8,52 @@ license: other
|
|
8 |
We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
|
9 |
|
10 |
#### Setup:
|
11 |
-
Install [
|
12 |
- It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
|
13 |
-
- Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
|
14 |
-
|
15 |
-
#### Running inference:
|
16 |
-
|
17 |
-
- Set the environment variable for ModelGenerator's data directory (**Note:** the docker image with our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) will already have it set):
|
18 |
```
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
```
|
21 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
-
|
|
|
26 |
```
|
27 |
bash prot_inverse_folding.sh
|
28 |
```
|
|
|
8 |
We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
|
9 |
|
10 |
#### Setup:
|
11 |
+
Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
|
12 |
- It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
|
13 |
+
- Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
|
14 |
+
- Here is an example bash script to set up and access a docker container:
|
|
|
|
|
|
|
15 |
```
|
16 |
+
# clone the ModelGenerator repository
|
17 |
+
git clone https://github.com/genbio-ai/ModelGenerator.git
|
18 |
+
# cd to "ModelGenerator" folder where you should find the "Dockerfile"
|
19 |
+
cd ModelGenerator
|
20 |
+
# create a docker image
|
21 |
+
docker build -t aido .
|
22 |
+
# create a local folder as ModelGenerator's data directory
|
23 |
+
mkdir -p $HOME/mgen_data
|
24 |
+
# run a container
|
25 |
+
docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
|
26 |
+
# find the container ID
|
27 |
+
docker ps # this will print the running containers and their IDs
|
28 |
+
# execute the container with ID=<container_id>
|
29 |
+
docker exec -it <container_id> /bin/bash # now you should be inside the docker container
|
30 |
+
# test if you can access the nvidia GPUs
|
31 |
+
nvidia-smi # this should print the GPUs' details
|
32 |
```
|
33 |
+
- Execute the following steps from **within** the docker container you just created.
|
34 |
+
|
35 |
+
#### Download model checkpoints:
|
36 |
+
|
37 |
+
- Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
|
38 |
+
|
39 |
+
**Alternatively**, you can simply run the following script to do this (Note: this script uses the [wget](https://www.gnu.org/software/wget/) tool):
|
40 |
+
```
|
41 |
+
mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
|
42 |
+
bash download_model_chunks.sh ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
|
43 |
+
```
|
44 |
|
45 |
+
#### Download data:
|
46 |
+
- Download the preprocessed CATH 4.2 dataset from [here](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/tree/main/cath-4.2). You should find two files named [chain_set_map.pkl](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_map.pkl) and [chain_set_splits.json](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_splits.json). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`. (Note that it was originally preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf), and we further preprocessed it to suit our pipeline.)
|
47 |
+
|
48 |
+
**Alternatively**, you can do it by simply running the following script:
|
49 |
+
```
|
50 |
+
mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
|
51 |
+
wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_map.pkl
|
52 |
+
wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_splits.json
|
53 |
+
```
|
54 |
|
55 |
+
#### Run inference:
|
56 |
+
- Then run the bash script for inference:
|
57 |
```
|
58 |
bash prot_inverse_folding.sh
|
59 |
```
|