Update README.md
Browse files
README.md
CHANGED
@@ -10,56 +10,86 @@ As with proteins, structure determines RNA function. RNA secondary structure, fo
|
|
10 |
|
11 |
We preprocessed and split the datasets (into train, test, and validation splits) in the same way as done in a previous study [RiNALMo](https://doi.org/10.48550/arXiv.2403.00043).
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
#### To finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) on RNA SS:
|
15 |
|
16 |
- Set the environment variable for ModelGenerator's data directory:
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
|
21 |
- Download the preprocessed data (provided as zip file named `rna_ss_data.zip`) from [here](https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/blob/main/rna_ss_data.zip). Unzip `rna_ss_data.zip` inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/`.
|
22 |
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
|
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
|
34 |
- Then run a finetuning job on either dataset as following (Note that here we are using finetuning scheduler. See [this tutorial](https://github.com/genbio-ai/ModelGenerator/blob/main/docs/docs/tutorials/finetuning_scheduler.md) for details):
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
|
49 |
#### To test a finetuned checkpoint on RNA SS:
|
50 |
- Finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) as discussed above, **or** download the `model.ckpt` checkpoint from [here](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B-inv-fold).
|
51 |
- Test the checkpoint on the _corresponding dataset_ as following (replace `/path/to/checkpoint` with the actual path to the finetuned checkpoint):
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
|
62 |
#### Outputs:
|
63 |
-
|
64 |
|
65 |
|
|
|
10 |
|
11 |
We preprocessed and split the datasets (into train, test, and validation splits) in the same way as done in a previous study [RiNALMo](https://doi.org/10.48550/arXiv.2403.00043).
|
12 |
|
13 |
+
##### Experimental Results
|
14 |
+
|
15 |
+
In the following table, we demonstrate RNA secondary structure prediction results on the bpRNA test set (namely, bpRNA-TS0).
|
16 |
+
| **Model** | **Precision** | **Recall** | **F1-score** |
|
17 |
+
|---------------------|---------------|------------|--------------|
|
18 |
+
| SPOT-RNA | 0.594 | 0.693 | 0.619 |
|
19 |
+
| UFold | 0.607 | 0.741 | 0.654 |
|
20 |
+
| RNA-FM | 0.709 | 0.664 | 0.676 |
|
21 |
+
| RNAErnie | 0.575 | 0.678 | 0.622 |
|
22 |
+
| RiNALMo | 0.784 | 0.730 | 0.747 |
|
23 |
+
| **AIDO.RNA (ours)** | **0.815** | **0.769** | **0.783** |
|
24 |
+
|
25 |
+
We also demonstrate inter-family generalization for secondary structure prediction on filtered Archive-II in the following table. Reported is the average F1 score. Bold denotes the best performance within a family.
|
26 |
+
| **RNA family** | **AIDO.RNA (ours)** | **RNAstructure** | **CONTRAfold** | **RiNALMo** | **RNA-FM** | **MXfold2** | **UFold** |
|
27 |
+
|--------------------|---------------------|------------------|----------------|-------------|------------|-------------|-----------|
|
28 |
+
| 5S rRNA | 0.853 | 0.63 | 0.63 | **0.88** | 0.57 | 0.54 | 0.53 |
|
29 |
+
| SRP RNA | **0.739** | 0.63 | 0.55 | 0.70 | 0.25 | 0.50 | 0.26 |
|
30 |
+
| tRNA | **0.945** | 0.70 | 0.77 | 0.93 | 0.79 | 0.64 | 0.26 |
|
31 |
+
| tmRNA | **0.838** | 0.43 | 0.49 | 0.80 | 0.28 | 0.46 | 0.41 |
|
32 |
+
| RNase P RNA | **0.804** | 0.55 | 0.63 | 0.80 | 0.31 | 0.51 | 0.41 |
|
33 |
+
| Group I Intron | 0.644. | 0.54 | 0.60 | **0.66** | 0.16 | 0.45 | 0.45 |
|
34 |
+
| 16S rRNA | **0.795** | 0.57 | 0.58 | 0.74 | 0.14 | 0.55 | 0.41 |
|
35 |
+
| Telomerase RNA | 0.085 | 0.50 | 0.54 | 0.12 | 0.07 | 0.34 | **0.80** |
|
36 |
+
| 23S rRNA | **0.896** | 0.73 | 0.71 | 0.85 | 0.19 | 0.64 | 0.45 |
|
37 |
+
| Average | **0.733** | 0.59 | 0.61 | 0.72 | 0.31 | 0.51 | 0.44 |
|
38 |
+
|
39 |
+
|
40 |
+
#
|
41 |
+
|
42 |
|
43 |
#### To finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) on RNA SS:
|
44 |
|
45 |
- Set the environment variable for ModelGenerator's data directory:
|
46 |
+
`
|
47 |
+
export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
|
48 |
+
`
|
49 |
|
50 |
- Download the preprocessed data (provided as zip file named `rna_ss_data.zip`) from [here](https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/blob/main/rna_ss_data.zip). Unzip `rna_ss_data.zip` inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/`.
|
51 |
|
52 |
+
**Alternatively**, you can simply run the following script to do this:
|
53 |
+
```
|
54 |
+
export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
|
55 |
+
mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/
|
56 |
+
wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/ https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/resolve/main/rna_ss_data.zip
|
57 |
+
unzip ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data.zip -d ${MGEN_DATA_DIR}/modelgenerator/datasets/
|
58 |
+
```
|
59 |
|
60 |
+
You should find two sub-folders containing the preprocessed datasets:
|
61 |
+
1. bpRNA: `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/bpRNA`
|
62 |
+
2. Archive-II: `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/archiveII`
|
63 |
|
64 |
- Then run a finetuning job on either dataset as following (Note that here we are using finetuning scheduler. See [this tutorial](https://github.com/genbio-ai/ModelGenerator/blob/main/docs/docs/tutorials/finetuning_scheduler.md) for details):
|
65 |
+
1. To train on bpRNA dataset, run the following command:
|
66 |
+
```
|
67 |
+
bash rna_secondary_structure_prediction.sh train bpRNA
|
68 |
+
```
|
69 |
+
2. Alternatively, to finetune on Archive-II datasets (for the inter-family generalization experiment discussed in the paper [AIDO.RNA](https://doi.org/10.1101/2024.11.28.625345)), run the following command:
|
70 |
+
```
|
71 |
+
bash rna_secondary_structure_prediction.sh train archiveII_<FamilyName>
|
72 |
+
```
|
73 |
+
Here, `<FamilyName>` is any of the following nine strings (representing different RNA families in Archive-II dataset): `5s, 16s, 23s, grp1, srp, telomerase, RNaseP, tmRNA, tRNA`. Note that, following the conventioned using by [RiNALMo's code repository](https://github.com/lbcb-sci/RiNALMo/tree/main), when a `<FamilyName>` is chosen, it will only be used as the **test set** and the rest of the families are used for training and validation. One example finetuning run with `5s` family:
|
74 |
+
```
|
75 |
+
bash rna_secondary_structure_prediction.sh train archiveII_5s
|
76 |
+
```
|
77 |
+
Here, the [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model will be finetuned using **all other splits except archiveII_5s**.
|
78 |
|
79 |
#### To test a finetuned checkpoint on RNA SS:
|
80 |
- Finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) as discussed above, **or** download the `model.ckpt` checkpoint from [here](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B-inv-fold).
|
81 |
- Test the checkpoint on the _corresponding dataset_ as following (replace `/path/to/checkpoint` with the actual path to the finetuned checkpoint):
|
82 |
+
1. To test on bpRNA dataset, run the following command:
|
83 |
+
```
|
84 |
+
bash rna_secondary_structure_prediction.sh test bpRNA /path/to/checkpoint
|
85 |
+
```
|
86 |
+
2. Alternatively, to test on Archive-II datasets, run the following command:
|
87 |
+
```
|
88 |
+
bash rna_secondary_structure_prediction.sh test archiveII_<FamilyName> /path/to/checkpoint
|
89 |
+
```
|
90 |
+
See the previous section for details on `<FamilyName>`.
|
91 |
|
92 |
#### Outputs:
|
93 |
+
The evaluation scores will be printed on the console.
|
94 |
|
95 |
|