smahbub commited on
Commit
5d5f5d5
·
verified ·
1 Parent(s): e05d634

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -35
README.md CHANGED
@@ -10,56 +10,86 @@ As with proteins, structure determines RNA function. RNA secondary structure, fo
10
 
11
  We preprocessed and split the datasets (into train, test, and validation splits) in the same way as done in a previous study [RiNALMo](https://doi.org/10.48550/arXiv.2403.00043).
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  #### To finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) on RNA SS:
15
 
16
  - Set the environment variable for ModelGenerator's data directory:
17
- ```
18
- export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
19
- ```
20
 
21
  - Download the preprocessed data (provided as zip file named `rna_ss_data.zip`) from [here](https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/blob/main/rna_ss_data.zip). Unzip `rna_ss_data.zip` inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/`.
22
 
23
- **Alternatively**, you can simply run the following script to do this:
24
- ```
25
- mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/
26
- wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/ https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/resolve/main/rna_ss_data.zip
27
- unzip ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data.zip -d ${MGEN_DATA_DIR}/modelgenerator/datasets/
28
- ```
 
29
 
30
- You should find two sub-folders containing the preprocessed datasets:
31
- 1. bpRNA: `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/bpRNA`
32
- 2. Archive-II: `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/archiveII`
33
 
34
  - Then run a finetuning job on either dataset as following (Note that here we are using finetuning scheduler. See [this tutorial](https://github.com/genbio-ai/ModelGenerator/blob/main/docs/docs/tutorials/finetuning_scheduler.md) for details):
35
- 1. To train on bpRNA dataset, run the following command:
36
- ```
37
- bash rna_secondary_structure_prediction.sh train bpRNA
38
- ```
39
- 2. Alternatively, to finetune on Archive-II datasets (for the inter-family generalization experiment discussed in the paper [AIDO.RNA](https://doi.org/10.1101/2024.11.28.625345)), run the following command:
40
- ```
41
- bash rna_secondary_structure_prediction.sh train archiveII_<FamilyName>
42
- ```
43
- Here, `<FamilyName>` is any of the following nine strings (representing different RNA families in Archive-II dataset): `5s, 16s, 23s, grp1, srp, telomerase, RNaseP, tmRNA, tRNA`. Note that, following the conventioned using by [RiNALMo's code repository](https://github.com/lbcb-sci/RiNALMo/tree/main), when a `<FamilyName>` is chosen, it will only be used as the **test set** and the rest of the families are used for training and validation. One example finetuning run with `5s` family:
44
- ```
45
- bash rna_secondary_structure_prediction.sh train archiveII_5s
46
- ```
47
- Here, the [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model will be finetuned using **all other splits except archiveII_5s**.
48
 
49
  #### To test a finetuned checkpoint on RNA SS:
50
  - Finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) as discussed above, **or** download the `model.ckpt` checkpoint from [here](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B-inv-fold).
51
  - Test the checkpoint on the _corresponding dataset_ as following (replace `/path/to/checkpoint` with the actual path to the finetuned checkpoint):
52
- 1. To test on bpRNA dataset, run the following command:
53
- ```
54
- bash rna_secondary_structure_prediction.sh test bpRNA /path/to/checkpoint
55
- ```
56
- 2. Alternatively, to test on Archive-II datasets, run the following command:
57
- ```
58
- bash rna_secondary_structure_prediction.sh test archiveII_<FamilyName> /path/to/checkpoint
59
- ```
60
- See the previous section for details on `<FamilyName>`.
61
 
62
  #### Outputs:
63
- - The evaluation scores will be printed on the console.
64
 
65
 
 
10
 
11
  We preprocessed and split the datasets (into train, test, and validation splits) in the same way as done in a previous study [RiNALMo](https://doi.org/10.48550/arXiv.2403.00043).
12
 
13
+ ##### Experimental Results
14
+
15
+ In the following table, we demonstrate RNA secondary structure prediction results on the bpRNA test set (namely, bpRNA-TS0).
16
+ | **Model** | **Precision** | **Recall** | **F1-score** |
17
+ |---------------------|---------------|------------|--------------|
18
+ | SPOT-RNA | 0.594 | 0.693 | 0.619 |
19
+ | UFold | 0.607 | 0.741 | 0.654 |
20
+ | RNA-FM | 0.709 | 0.664 | 0.676 |
21
+ | RNAErnie | 0.575 | 0.678 | 0.622 |
22
+ | RiNALMo | 0.784 | 0.730 | 0.747 |
23
+ | **AIDO.RNA (ours)** | **0.815** | **0.769** | **0.783** |
24
+
25
+ We also demonstrate inter-family generalization for secondary structure prediction on filtered Archive-II in the following table. Reported is the average F1 score. Bold denotes the best performance within a family.
26
+ | **RNA family** | **AIDO.RNA (ours)** | **RNAstructure** | **CONTRAfold** | **RiNALMo** | **RNA-FM** | **MXfold2** | **UFold** |
27
+ |--------------------|---------------------|------------------|----------------|-------------|------------|-------------|-----------|
28
+ | 5S rRNA | 0.853 | 0.63 | 0.63 | **0.88** | 0.57 | 0.54 | 0.53 |
29
+ | SRP RNA | **0.739** | 0.63 | 0.55 | 0.70 | 0.25 | 0.50 | 0.26 |
30
+ | tRNA | **0.945** | 0.70 | 0.77 | 0.93 | 0.79 | 0.64 | 0.26 |
31
+ | tmRNA | **0.838** | 0.43 | 0.49 | 0.80 | 0.28 | 0.46 | 0.41 |
32
+ | RNase P RNA | **0.804** | 0.55 | 0.63 | 0.80 | 0.31 | 0.51 | 0.41 |
33
+ | Group I Intron | 0.644. | 0.54 | 0.60 | **0.66** | 0.16 | 0.45 | 0.45 |
34
+ | 16S rRNA | **0.795** | 0.57 | 0.58 | 0.74 | 0.14 | 0.55 | 0.41 |
35
+ | Telomerase RNA | 0.085 | 0.50 | 0.54 | 0.12 | 0.07 | 0.34 | **0.80** |
36
+ | 23S rRNA | **0.896** | 0.73 | 0.71 | 0.85 | 0.19 | 0.64 | 0.45 |
37
+ | Average | **0.733** | 0.59 | 0.61 | 0.72 | 0.31 | 0.51 | 0.44 |
38
+
39
+
40
+ #
41
+
42
 
43
  #### To finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) on RNA SS:
44
 
45
  - Set the environment variable for ModelGenerator's data directory:
46
+ `
47
+ export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
48
+ `
49
 
50
  - Download the preprocessed data (provided as zip file named `rna_ss_data.zip`) from [here](https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/blob/main/rna_ss_data.zip). Unzip `rna_ss_data.zip` inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/`.
51
 
52
+ **Alternatively**, you can simply run the following script to do this:
53
+ ```
54
+ export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
55
+ mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/
56
+ wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/ https://huggingface.co/datasets/genbio-ai/rna-secondary-structure-prediction/resolve/main/rna_ss_data.zip
57
+ unzip ${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data.zip -d ${MGEN_DATA_DIR}/modelgenerator/datasets/
58
+ ```
59
 
60
+ You should find two sub-folders containing the preprocessed datasets:
61
+ 1. bpRNA: `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/bpRNA`
62
+ 2. Archive-II: `${MGEN_DATA_DIR}/modelgenerator/datasets/rna_ss_data/archiveII`
63
 
64
  - Then run a finetuning job on either dataset as following (Note that here we are using finetuning scheduler. See [this tutorial](https://github.com/genbio-ai/ModelGenerator/blob/main/docs/docs/tutorials/finetuning_scheduler.md) for details):
65
+ 1. To train on bpRNA dataset, run the following command:
66
+ ```
67
+ bash rna_secondary_structure_prediction.sh train bpRNA
68
+ ```
69
+ 2. Alternatively, to finetune on Archive-II datasets (for the inter-family generalization experiment discussed in the paper [AIDO.RNA](https://doi.org/10.1101/2024.11.28.625345)), run the following command:
70
+ ```
71
+ bash rna_secondary_structure_prediction.sh train archiveII_<FamilyName>
72
+ ```
73
+ Here, `<FamilyName>` is any of the following nine strings (representing different RNA families in Archive-II dataset): `5s, 16s, 23s, grp1, srp, telomerase, RNaseP, tmRNA, tRNA`. Note that, following the conventioned using by [RiNALMo's code repository](https://github.com/lbcb-sci/RiNALMo/tree/main), when a `<FamilyName>` is chosen, it will only be used as the **test set** and the rest of the families are used for training and validation. One example finetuning run with `5s` family:
74
+ ```
75
+ bash rna_secondary_structure_prediction.sh train archiveII_5s
76
+ ```
77
+ Here, the [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model will be finetuned using **all other splits except archiveII_5s**.
78
 
79
  #### To test a finetuned checkpoint on RNA SS:
80
  - Finetune [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) as discussed above, **or** download the `model.ckpt` checkpoint from [here](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B-inv-fold).
81
  - Test the checkpoint on the _corresponding dataset_ as following (replace `/path/to/checkpoint` with the actual path to the finetuned checkpoint):
82
+ 1. To test on bpRNA dataset, run the following command:
83
+ ```
84
+ bash rna_secondary_structure_prediction.sh test bpRNA /path/to/checkpoint
85
+ ```
86
+ 2. Alternatively, to test on Archive-II datasets, run the following command:
87
+ ```
88
+ bash rna_secondary_structure_prediction.sh test archiveII_<FamilyName> /path/to/checkpoint
89
+ ```
90
+ See the previous section for details on `<FamilyName>`.
91
 
92
  #### Outputs:
93
+ The evaluation scores will be printed on the console.
94
 
95