genbio-ai
/

AIDO.ProteinIF-16B

Model card Files Files and versions Community

smahbub commited on Dec 20, 2024

Commit

57310f0

verified ·

1 Parent(s): a60f9ed

Update README.md

Browse files

Files changed (1) hide show

README.md +41 -10

README.md CHANGED Viewed

@@ -8,21 +8,52 @@ license: other
 We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
 #### Setup:
-Install [Model Generator](https://github.com/genbio-ai/modelgenerator).
 - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
-- Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
-#### Running inference:
-- Set the environment variable for ModelGenerator's data directory (**Note:** the docker image with our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) will already have it set):
     ```
-    export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice, if you would like to change it inside [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile)
     ```
-- Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
-- Download the CATH 4.2 dataset preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf) from [here](http://people.csail.mit.edu/ingraham/graph-protein-design/data/cath/). You should find two files named `chain_set.jsonl` and `chain_set_splits.json`. Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`.
-- Then run the bash script:
     ```
     bash prot_inverse_folding.sh
     ```

 We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
 #### Setup:
+Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
 - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
+- Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
+  - Here is an example bash script to set up and access a docker container:
     ```
+    # clone the ModelGenerator repository
+    git clone https://github.com/genbio-ai/ModelGenerator.git
+    # cd to "ModelGenerator" folder where you should find the "Dockerfile"
+    cd ModelGenerator
+    # create a docker image
+    docker build -t aido .
+    # create a local folder as ModelGenerator's data directory
+    mkdir -p $HOME/mgen_data
+    # run a container
+    docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
+    # find the container ID
+    docker ps # this will print the running containers and their IDs
+    # execute the container with ID=<container_id>
+    docker exec -it <container_id> /bin/bash  # now you should be inside the docker container
+    # test if you can access the nvidia GPUs
+    nvidia-smi # this should print the GPUs' details
     ```
+- Execute the following steps from **within** the docker container you just created.
+#### Download model checkpoints:
+- Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.ProteinIF-16B/tree/main). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks`.
+  **Alternatively**, you can simply run the following script to do this (Note: this script uses the [wget](https://www.gnu.org/software/wget/) tool):
+  ```
+  mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
+  bash download_model_chunks.sh ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
+  ```
+#### Download data:
+- Download the preprocessed CATH 4.2 dataset from [here](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/tree/main/cath-4.2).  You should find two files named [chain_set_map.pkl](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_map.pkl) and [chain_set_splits.json](https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/blob/main/cath-4.2/chain_set_splits.json). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`. (Note that it was originally preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf), and we further preprocessed it to suit our pipeline.)
+  **Alternatively**, you can do it by simply running the following script:
+  ```
+  mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
+  wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_map.pkl
+  wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_splits.json
+  ```
+#### Run inference:
+- Then run the bash script for inference:
     ```
     bash prot_inverse_folding.sh
     ```