smahbub commited on
Commit
89f0b84
·
verified ·
1 Parent(s): 63b37db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -82
README.md CHANGED
@@ -6,97 +6,113 @@ base_model:
6
  # RNA Inverse Folding
7
  RNA inverse folding is a computational method designed to create RNA sequences that fold into predetermined three-dimensional structures. Our study focuses on generating sequences using the known backbone structure of an RNA, defined by the 3D coordinates of its backbone atoms, without any information of the individual bases. Specifically. we fully finetune the [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model on the single-state split from [Das _et al._](https://www.nature.com/articles/nmeth.1433) already processed by [Joshi _et al._](https://arxiv.org/abs/2305.14749). We use the same train, validation, and test splits used by their method [gRNAde](https://arxiv.org/abs/2305.14749). Current version of ModelGenerator contains the inference pipeline for RNA inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
8
 
9
- #### Setup:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
11
  - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
12
  - Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
13
- - Here is an example bash script to set up and access a docker container:
14
- ```
15
- # clone the ModelGenerator repository
16
- git clone https://github.com/genbio-ai/ModelGenerator.git
17
- # cd to "ModelGenerator" folder where you should find the "Dockerfile"
18
- cd ModelGenerator
19
- # create a docker image
20
- docker build -t aido .
21
- # create a local folder as ModelGenerator's data directory
22
- mkdir -p $HOME/mgen_data
23
- # run a container
24
- docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
25
- # find the container ID
26
- docker ps # this will print the running containers and their IDs
27
- # execute the container with ID=<container_id>
28
- docker exec -it <container_id> /bin/bash # now you should be inside the docker container
29
- # test if you can access the nvidia GPUs
30
- nvidia-smi # this should print the GPUs' details
31
- ```
 
32
  - Execute the following steps from **within** the docker container you just created.
33
  - **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.
34
 
35
- #### Download model checkpoints:
36
 
37
  - Download the `model.ckpt` checkpoint from [here](https://huggingface.co/genbio-ai/AIDO.RNAIF-1.6B/blob/main/model.ckpt). Place it inside the local directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B`.
38
 
39
  - Download the gRNAde checkpoint named `gRNAde_ARv1_1state_das.h5` from the [huggingface-hub](https://huggingface.co/genbio-ai/AIDO.RNAIF-1.6B/blob/main/other_models/gRNAde_ARv1_1state_all.h5) ***or*** the [original source](https://github.com/chaitjo/geometric-rna-design/blob/main/checkpoints/gRNAde_ARv1_1state_all.h5). Place it inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models`. Set the environment variable `gRNAde_CKPT_PATH=${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models/gRNAde_ARv1_1state_das.h5`
40
 
41
- **Alternatively**, you can simply run the following script to do both of these steps:
42
- ```
43
- mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B
44
- huggingface-cli download genbio-ai/AIDO.RNAIF-1.6B \
45
- --repo-type model \
46
- --local-dir ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B
47
- # Set the environment variable gRNAde_CKPT_PATH
48
- export gRNAde_CKPT_PATH=${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models/gRNAde_ARv1_1state_das.h5
49
- ```
50
-
51
- #### Download data:
52
- - Download the data preprocessed by [Joshi _et al._](https://arxiv.org/abs/2305.14749). Mainly download these two files: [processed.pt.zip](https://huggingface.co/datasets/genbio-ai/rna-inverse-folding/blob/main/processed.pt.zip) and [processed_df.csv](https://huggingface.co/datasets/genbio-ai/rna-inverse-folding/blob/main/processed_df.csv). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/`. Please refer to [this link](https://github.com/chaitjo/geometric-rna-design/tree/main?tab=readme-ov-file#downloading-and-preparing-data) for details about the dataset and its preprocessing.
53
-
54
- **Alternatively**, you run the following script to do it:
55
- ```
56
- mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/
57
- huggingface-cli download genbio-ai/rna-inverse-folding \
58
- --repo-type dataset \
59
- --local-dir ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/
60
- ```
61
-
62
- #### Run inference:
63
- - From your terminal, change directory to `experiments/AIDO.RNA/rna_inverse_folding` folder and run the following script:
64
- ```
65
- cd modelgenerator/rna_inv_fold/gRNAde_structure_encoder
66
- echo "Running inference.."
67
- python main.py
68
- echo "Extracting structure encoding.."
69
- python main_encoder_only.py
70
- cd ../../../experiments/AIDO.RNA/rna_inverse_folding/
71
- # run inference
72
- mgen test --config rna_inv_fold_test.yaml \
73
- --trainer.default_root_dir ${MGEN_DATA_DIR}/modelgenerator/logs/rna_inv_fold/ \
74
- --ckpt_path ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/model.ckpt \
75
- --trainer.devices 0, \
76
- --data.path ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/structure_encoding/
77
- ```
78
-
79
- #### Outputs:
80
  - The evaluation score will be printed on the console.
81
- - The generated sequences will be stored in `./rnaIF_outputs/designed_sequences.json`.
82
- - In this file, we will have:
83
- 1. **`"true_seq"`**: the ground truth sequences,
84
- 2. **`"pred_seq"`**: predicted sequences by our method,
85
- 3. **`"baseline_seq"`**: predicted sequences by the baseline method [gRNAde](https://arxiv.org/abs/2305.14749).
86
- - An example file content with two test samples is shown below:
87
- ```
88
- {
89
- "true_seq": [
90
- "CCCAGUCCACCGGGUGAGAAGGGGGCAGAGAAACACACGACGUGGUGCAUUACCUGCC",
91
- "UCCCGUCCACCGCGGUGAGAAGGGGGCAGAGAAACACACGAUCGUGGUACAUUACCUGCC",
92
- ],
93
- "pred_seq": [
94
- "UGGGGAGCCCCCGGGGUGAACCAGCCGGUGAAAGGCACCCGGUGAUCGGUCAGCCCAC",
95
- "GCGGAUGCCCCGCCCGGUCAACCGCAUGGUGAAAUCCACGCGCCUGGUGGGUUAGCCAUG",
96
- ],
97
- "baseline_seq": [
98
- "UGGUGAGCCCCCGGGGUGAACCAGUAGGUGAAAGGCACCCGGUGAUCGGUCAGCCCAC",
99
- "GCGGAUGCCGGGCCCGGUCCACCGCAUGGUGAAAUUCAGGCGCCUGGAGGGUUAGCCAUG",
100
- ]
101
- }
102
- ```
 
6
  # RNA Inverse Folding
7
  RNA inverse folding is a computational method designed to create RNA sequences that fold into predetermined three-dimensional structures. Our study focuses on generating sequences using the known backbone structure of an RNA, defined by the 3D coordinates of its backbone atoms, without any information of the individual bases. Specifically. we fully finetune the [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model on the single-state split from [Das _et al._](https://www.nature.com/articles/nmeth.1433) already processed by [Joshi _et al._](https://arxiv.org/abs/2305.14749). We use the same train, validation, and test splits used by their method [gRNAde](https://arxiv.org/abs/2305.14749). Current version of ModelGenerator contains the inference pipeline for RNA inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.
8
 
9
+ ##### Experimental Results
10
+ We evaluate our model in two settings: (1) adaptation with conditional diffusion where AIDO.RNA is fine-tuned for the inverse folding task; and (2) zero-shot generation where AIDO.RNA is frozen.
11
+ | Model | Mean Sequence Recovery Rate |
12
+ | ------ | ------ |
13
+ | gRNAde | 52.78 |
14
+ | gRNAde+AIDO.RNA-Zeroshot | 53.16 |
15
+ | gRNAde+AIDO.RNA-Finetuned. | 54.41 |
16
+
17
+ ##### Acknowledgement
18
+ We thank the authors of [gRNAde](https://arxiv.org/abs/2305.14749) for providing their models' checkpoints and configuration files.
19
+
20
+ #
21
+
22
+ In the following sections, we discuss how to use AIDO.RNA for RNA inverse folding using ModelGenerator.
23
+
24
+ #### Setup
25
  Install [ModelGenerator](https://github.com/genbio-ai/modelgenerator).
26
  - It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
27
  - Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.
28
+
29
+ Here is an example bash script to set up and access a docker container:
30
+ ```
31
+ # clone the ModelGenerator repository
32
+ git clone https://github.com/genbio-ai/ModelGenerator.git
33
+ # cd to "ModelGenerator" folder where you should find the "Dockerfile"
34
+ cd ModelGenerator
35
+ # create a docker image
36
+ docker build -t aido .
37
+ # create a local folder as ModelGenerator's data directory
38
+ mkdir -p $HOME/mgen_data
39
+ # run a container
40
+ docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
41
+ # find the container ID
42
+ docker ps # this will print the running containers and their IDs
43
+ # execute the container with ID=<container_id>
44
+ docker exec -it <container_id> /bin/bash # now you should be inside the docker container
45
+ # test if you can access the nvidia GPUs
46
+ nvidia-smi # this should print the GPUs' details
47
+ ```
48
  - Execute the following steps from **within** the docker container you just created.
49
  - **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.
50
 
51
+ #### Download model checkpoints
52
 
53
  - Download the `model.ckpt` checkpoint from [here](https://huggingface.co/genbio-ai/AIDO.RNAIF-1.6B/blob/main/model.ckpt). Place it inside the local directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B`.
54
 
55
  - Download the gRNAde checkpoint named `gRNAde_ARv1_1state_das.h5` from the [huggingface-hub](https://huggingface.co/genbio-ai/AIDO.RNAIF-1.6B/blob/main/other_models/gRNAde_ARv1_1state_all.h5) ***or*** the [original source](https://github.com/chaitjo/geometric-rna-design/blob/main/checkpoints/gRNAde_ARv1_1state_all.h5). Place it inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models`. Set the environment variable `gRNAde_CKPT_PATH=${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models/gRNAde_ARv1_1state_das.h5`
56
 
57
+ **Alternatively**, you can simply run the following script to do both of these steps:
58
+ ```
59
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B
60
+ huggingface-cli download genbio-ai/AIDO.RNAIF-1.6B \
61
+ --repo-type model \
62
+ --local-dir ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B
63
+ # Set the environment variable gRNAde_CKPT_PATH
64
+ export gRNAde_CKPT_PATH=${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/other_models/gRNAde_ARv1_1state_das.h5
65
+ ```
66
+
67
+ #### Download data
68
+ Download the data preprocessed by [Joshi _et al._](https://arxiv.org/abs/2305.14749). Mainly download these two files: [processed.pt.zip](https://huggingface.co/datasets/genbio-ai/rna-inverse-folding/blob/main/processed.pt.zip) and [processed_df.csv](https://huggingface.co/datasets/genbio-ai/rna-inverse-folding/blob/main/processed_df.csv). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/`. Please refer to [this link](https://github.com/chaitjo/geometric-rna-design/tree/main?tab=readme-ov-file#downloading-and-preparing-data) for details about the dataset and its preprocessing.
69
+
70
+ **Alternatively**, you run the following script to do it:
71
+ ```
72
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/
73
+ huggingface-cli download genbio-ai/rna-inverse-folding \
74
+ --repo-type dataset \
75
+ --local-dir ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/raw_data/
76
+ ```
77
+
78
+ #### Run inference
79
+ From your terminal, change directory to `experiments/AIDO.RNA/rna_inverse_folding` folder and run the following script:
80
+ ```
81
+ cd modelgenerator/rna_inv_fold/gRNAde_structure_encoder
82
+ echo "Running inference.."
83
+ python main.py
84
+ echo "Extracting structure encoding.."
85
+ python main_encoder_only.py
86
+ cd ../../../experiments/AIDO.RNA/rna_inverse_folding/
87
+ # run inference
88
+ mgen test --config rna_inv_fold_test.yaml \
89
+ --trainer.default_root_dir ${MGEN_DATA_DIR}/modelgenerator/logs/rna_inv_fold/ \
90
+ --ckpt_path ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/rna_inv_fold/AIDO.RNAIF-1.6B/model.ckpt \
91
+ --trainer.devices 0, \
92
+ --data.path ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_inv_fold/structure_encoding/
93
+ ```
94
+
95
+ #### Outputs
96
  - The evaluation score will be printed on the console.
97
+ - The generated sequences will be stored in `./rnaIF_outputs/designed_sequences.json`. In this file, we will have:
98
+ 1. **`"true_seq"`**: the ground truth sequences,
99
+ 2. **`"pred_seq"`**: predicted sequences by our method,
100
+ 3. **`"baseline_seq"`**: predicted sequences by the baseline method [gRNAde](https://arxiv.org/abs/2305.14749).
101
+
102
+ An example file content with two test samples is shown below:
103
+ ```
104
+ {
105
+ "true_seq": [
106
+ "CCCAGUCCACCGGGUGAGAAGGGGGCAGAGAAACACACGACGUGGUGCAUUACCUGCC",
107
+ "UCCCGUCCACCGCGGUGAGAAGGGGGCAGAGAAACACACGAUCGUGGUACAUUACCUGCC",
108
+ ],
109
+ "pred_seq": [
110
+ "UGGGGAGCCCCCGGGGUGAACCAGCCGGUGAAAGGCACCCGGUGAUCGGUCAGCCCAC",
111
+ "GCGGAUGCCCCGCCCGGUCAACCGCAUGGUGAAAUCCACGCGCCUGGUGGGUUAGCCAUG",
112
+ ],
113
+ "baseline_seq": [
114
+ "UGGUGAGCCCCCGGGGUGAACCAGUAGGUGAAAGGCACCCGGUGAUCGGUCAGCCCAC",
115
+ "GCGGAUGCCGGGCCCGGUCCACCGCAUGGUGAAAUUCAGGCGCCUGGAGGGUUAGCCAUG",
116
+ ]
117
+ }
118
+ ```