File size: 3,345 Bytes
6669ac5 56462eb 88efac5 6669ac5 56462eb 598945b 56462eb f15ce14 70e5ae2 f15ce14 70e5ae2 56462eb 7fb6582 f15ce14 70e5ae2 56462eb 70e5ae2 7fb6582 56462eb 70e5ae2 56462eb 70e5ae2 f15ce14 70e5ae2 7fb6582 f15ce14 70e5ae2 f15ce14 70e5ae2 7fb6582 f15ce14 56462eb 70e5ae2 7fb6582 70e5ae2 f15ce14 56462eb 829782c 9e50aa5 829782c 4591825 829782c 56462eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
base_model:
- genbio-ai/AIDO.RNA-1.6B
license: other
---
# AIDO.RNA-1.6B-CDS
AIDO.RNA-1.6B-CDS is a domain adaptation model on the coding sequences. It was pre-trained on 9 million coding sequences released by Carlos et al. (2024) [1] based on our [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model.
## How to Use
### Build any downstream models from this backbone with ModelGenerator
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
```bash
mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
```
### Or use directly in Python
#### Embedding
```python
from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
embedding = model(transformed_batch)
print(embedding.shape)
print(embedding)
```
#### Sequence-level Classification
```python
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 2}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
logits = model(transformed_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Token-level Classification
```python
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 3}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
logits = model(transformed_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Sequence-level Regression
```python
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "AGCT"]})
logits = model(transformed_batch)
print(logits)
```
### Get RNA sequence embedding
```python
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
transformed_batch = model.transform({"sequences": ["ACGT", "ACGT"]})
embedding = model(transformed_batch)
print(embedding.shape)
print(embedding)
```
## Citation
Please cite AIDO.RNA using the following BibTeX code:
```
@inproceedings{zou_large-scale_2024,
title = {A Large-Scale Foundation Model for RNA Function and Structure Prediction},
url = {https://www.biorxiv.org/content/10.1101/2024.11.28.625345v1},
doi = {10.1101/2024.11.28.625345},
publisher = {bioRxiv},
author = {Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Ellington, Caleb N. and Algayres, Robin and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Song, Le and Xing, Eric P.},
year = {2024},
booktitle = {NeurIPS 2024 Workshop on AI for New Drug Modalities},
}
```
## Reference
1. Carlos Outeiral and Charlotte M Deane. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 6(2):170–179, 2024. |