AIDO.RNA-1.6B-CDS / README.md
ShuxianZou's picture
Update README.md
829782c verified
|
raw
history blame
2.89 kB
---
base_model:
- genbio-ai/AIDO.RNA-1.6B
---
# AIDO.RNA-1.6B-CDS
AIDO.RNA-1.6B-CDS is a domain adaptation model on the coding sequences. It was pre-trained on 9 million coding sequences released by Carlos et al. (2024) [1] based on our [AIDO.RNA-1.6B](https://huggingface.co/genbio-ai/AIDO.RNA-1.6B) model.
## How to Use
### Build any downstream models from this backbone
#### Get RNA sequence embedding
```python
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
#### Sequence-level regression
```python
from genbio_finetune.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_rna_1b600m_cds"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
```
#### Sequence-level classification
```python
import torch
from genbio_finetune.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Token-level classification
```python
import torch
from genbio_finetune.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_rna_1b600m_cds", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Or use our one-liner CLI to finetune or evaluate any of the above!
```
mgen fit --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_rna_1b600m_cds --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
```
For more information, visit: [ModelGenerator](https://github.com/genbio-ai/modelgenerator)
## Citation
Please cite AIDO.RNA using the following BibTeX code:
```
@inproceedings{
zou2024a,
title={A Large-Scale Foundation Model for {RNA} Function and Structure Prediction},
author={Shuxian Zou and Tianhua Tao and Sazan Mahbub and Caleb Ellington and Robin Jonathan Algayres and Dian Li and Yonghao Zhuang and Hongyi Wang and Le Song and Eric P. Xing},
booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
year={2024},
url={https://openreview.net/forum?id=Gzo3JMPY8w}
}
```
## Reference
1. Carlos Outeiral and Charlotte M Deane. Codon language embeddings provide strong signals for use in protein engineering. Nature Machine Intelligence, 6(2):170–179, 2024.