probablybots commited on
Commit
70e5ae2
·
verified ·
1 Parent(s): 5f58132

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -31
README.md CHANGED
@@ -7,68 +7,73 @@ base_model:
7
  AIDO.RNA-1.6B-CDS is a domain adaptation model on the coding sequences. It was pre-trained on 9 million coding sequences released by Carlos et al. (2024) [1] based on our [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model.
8
 
9
  ## How to Use
10
- ### Build any downstream models from this backbone
11
- #### Get RNA sequence embedding
 
 
 
 
 
 
 
12
  ```python
13
- from genbio_finetune.tasks import Embed
14
  model = Embed.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
15
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
16
  embedding = model(collated_batch)
17
  print(embedding.shape)
18
  print(embedding)
19
  ```
20
-
21
- #### Sequence-level regression
22
  ```python
23
- from genbio_finetune.tasks import SequenceRegression
24
- model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
 
25
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
26
  logits = model(collated_batch)
27
  print(logits)
 
28
  ```
29
-
30
- #### Sequence-level classification
31
  ```python
32
  import torch
33
- from genbio_finetune.tasks import SequenceClassification
34
- model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 2}).eval()
35
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
36
  logits = model(collated_batch)
37
  print(logits)
38
  print(torch.argmax(logits, dim=-1))
39
  ```
40
-
41
- #### Token-level classification
42
  ```python
43
- import torch
44
- from genbio_finetune.tasks import TokenClassification
45
- model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 3}).eval()
46
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
47
  logits = model(collated_batch)
48
  print(logits)
49
- print(torch.argmax(logits, dim=-1))
50
  ```
51
 
52
- #### Or use our one-liner CLI to finetune or evaluate any of the above!
53
- ```
54
- mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
55
- mgen test --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
 
 
 
 
56
  ```
57
- For more information, visit: [ModelGenerator](https://github.com/genbio-ai/modelgenerator)
58
-
59
 
60
  ## Citation
61
  Please cite AIDO.RNA using the following BibTeX code:
62
  ```
63
- @inproceedings{
64
- zou2024a,
65
- title={A Large-Scale Foundation Model for {RNA} Function and Structure Prediction},
66
- author={Shuxian Zou and Tianhua Tao and Sazan Mahbub and Caleb Ellington and Robin Jonathan Algayres and Dian Li and Yonghao Zhuang and Hongyi Wang and Le Song and Eric P. Xing},
67
- booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
68
- year={2024},
69
- url={https://openreview.net/forum?id=Gzo3JMPY8w}
70
  }
71
- ```
72
 
73
  ## Reference
74
  1. Carlos Outeiral and Charlotte M Deane. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 6(2):170–179, 2024.
 
7
  AIDO.RNA-1.6B-CDS is a domain adaptation model on the coding sequences. It was pre-trained on 9 million coding sequences released by Carlos et al. (2024) [1] based on our [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model.
8
 
9
  ## How to Use
10
+ ### Build any downstream models from this backbone with ModelGenerator
11
+ For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
12
+ ```bash
13
+ mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
14
+ mgen test --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
15
+ ```
16
+
17
+ ### Or use directly in Python
18
+ #### Embedding
19
  ```python
20
+ from modelgenerator.tasks import Embed
21
  model = Embed.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
22
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
23
  embedding = model(collated_batch)
24
  print(embedding.shape)
25
  print(embedding)
26
  ```
27
+ #### Sequence-level Classification
 
28
  ```python
29
+ import torch
30
+ from modelgenerator.tasks import SequenceClassification
31
+ model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 2}).eval()
32
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
33
  logits = model(collated_batch)
34
  print(logits)
35
+ print(torch.argmax(logits, dim=-1))
36
  ```
37
+ #### Token-level Classification
 
38
  ```python
39
  import torch
40
+ from modelgenerator.tasks import TokenClassification
41
+ model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 3}).eval()
42
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
43
  logits = model(collated_batch)
44
  print(logits)
45
  print(torch.argmax(logits, dim=-1))
46
  ```
47
+ #### Sequence-level Regression
 
48
  ```python
49
+ from modelgenerator.tasks import SequenceRegression
50
+ model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
 
51
  collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
52
  logits = model(collated_batch)
53
  print(logits)
 
54
  ```
55
 
56
+ ### Get RNA sequence embedding
57
+ ```python
58
+ from genbio_finetune.tasks import Embed
59
+ model = Embed.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
60
+ collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
61
+ embedding = model(collated_batch)
62
+ print(embedding.shape)
63
+ print(embedding)
64
  ```
 
 
65
 
66
  ## Citation
67
  Please cite AIDO.RNA using the following BibTeX code:
68
  ```
69
+ @misc{zou_large-scale_2024,
70
+ title = {A Large-Scale Foundation Model for RNA Function and Structure Prediction},
71
+ url = {https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1},
72
+ doi = {10.1101/2024.11.28.625345},
73
+ publisher = {bioRxiv},
74
+ author = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
75
+ year = {2024},
76
  }
 
77
 
78
  ## Reference
79
  1. Carlos Outeiral and Charlotte M Deane. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 6(2):170–179, 2024.