genbio-ai
/

AIDO.Protein-RAG-3B

PyTorch

fm4bio

Model card Files Files and versions Community

Pan6461188 commited on Apr 10

Commit

9e3dad5

1 Parent(s): f6bacea

update readme3

Browse files

Files changed (1) hide show

README.md +36 -15

README.md CHANGED Viewed

@@ -4,13 +4,13 @@ license: other
 # AIDO.Protein-RAG-3B
-AIDO.Protein-RAG-3B (AIDO.RAGPLM) is a pretrained model for Retrieval-AuGmented protein language model in an [AI-driven Digital Organism](https://arxiv.org/abs/2412.06993). AIDO.Protein-RAG-3B (and [AIDO.RAGFold](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1)) integrates pre-trained protein language models with retrieved MSA, allowing for the incorporation of co-evolutionary information in structure prediction while compensating for insufficient MSA information through large-scale pretraining.
-AIDO.Protein-RAG-3B surpasses single-sequence protein language models in perplexity, contact prediction, and fitness prediction. We utilized AIDO.Protein-RAG-3B as the feature extractor for protein structure prediction, resulting in the development of [AIDO.RAGFold](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1). When sufficient MSA is available, AIDO.RAGFold achieves TM-scores comparable to AlphaFold2 and operates up to eight times faster. In scenarios where MSA is insufficient, our method significantly outperforms AlphaFold2 (∆TM-score=0.379, 0.116 and 0.059 for 0, 5 and 10 MSA sequences as input).
-## Model Architecture Details
-AIDO.Protein-RAG-3B is a transformer encoder-only architecture with the dense MLP layer in each transformer block (Panel **c** below). It uses single amino acid tokenization and is optimized using a masked languange modeling (MLM) training objective.
 <center><img src="architecture.png" alt="An Overview of AIDO.Protein" style="width:90%; height:auto;" /></center>
@@ -24,9 +24,9 @@ More architecture details are shown below:
 | FFN Hidden Size    | 6832  |
 | Context Length     | 12.8K |
-## Pre-training of AIDO.Protein-RAG-3B
-### Data
 **UniRef50/Uniclust30 MSA dataset**: We utilized sequences from UniRef50 as queries to search for homologous sequences in UniClust30, subsequently constructing multiple sequence alignments (MSAs). UniRef50 comprises a total of 53.6 million sequences. Using HHblits, we searched all sequences, identifying over 25 homologous sequences for 23.7 million of them. This dataset was directly used as the training set, referred to as `HHblits_MSA`. The remaining 29.9 million sequences were input into MSA Retriever, resulting in 7.7 million sequences with more than 25 homologous sequences. This dataset was designated as `Retriever_MSA`. During training, RAGPLM randomly sampled from the two datasets with probabilities of 0.75 and 0.25
@@ -59,7 +59,7 @@ AIDO.Protein-RAG-3B surpasses single-sequence protein language models in perplex
 <center><img src="unsupervised_contact_prediction.png" alt="unsupervised_contact_prediction" style="width:90%; height:auto;" /></center>
-### Supervised Contact Prediction
 <center><img src="supervised_tasks.png" alt="supervised_tasks" style="width:90%; height:auto;" /></center>
@@ -69,7 +69,7 @@ AIDO.Protein-RAG-3B surpasses single-sequence protein language models in perplex
 ## How to Use
-### Build any downstream models from this backbone with ModelGenerator
 For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
@@ -78,7 +78,7 @@ mgen fit --model SequenceClassification --model.backbone aido_protein_rag_3b --d
 mgen test --model SequenceClassification --model.backbone aido_protein_rag_3b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
 ```
-### Or use directly in Python
 #### Embedding
@@ -87,7 +87,12 @@ import torch
 from modelgenerator.tasks import Embed
 model = Embed.from_config({"model.backbone": "aido_protein_rag_3b"}).eval()
 model.backbone.max_length = 12800
-data = torch.load("examples.pt", 'cpu')[0]
 transformed_batch = model.transform(data)
 with torch.no_grad():
     embedding = model(transformed_batch)
@@ -102,7 +107,12 @@ import torch
 from modelgenerator.tasks import SequenceClassification
 model = SequenceClassification.from_config({"model.backbone": "aido_protein_rag_3b", "model.n_classes": 2}).eval()
 model.backbone.max_length = 12800
-data = torch.load("examples.pt", 'cpu')[0]
 transformed_batch = model.transform(data)
 with torch.no_grad():
     logits = model(transformed_batch)
@@ -118,7 +128,12 @@ import torch
 from modelgenerator.tasks import TokenClassification
 model = TokenClassification.from_config({"model.backbone": "aido_protein_rag_3b", "model.n_classes": 3}).eval()
 model.backbone.max_length = 12800
-data = torch.load("examples.pt", 'cpu')[0]
 transformed_batch = model.transform(data)
 with torch.no_grad():
     logits = model(transformed_batch)
@@ -127,13 +142,19 @@ print(logits)
 print(torch.argmax(logits, dim=-1))
 ```
-#### Regression
 ```python
 from modelgenerator.tasks import SequenceRegression
-model = SequenceRegression.from_config({"model.backbone": "aido_protein_16b_ragplm"}).eval()
 model.backbone.max_length = 12800
-data = torch.load("examples.pt", 'cpu')[0]
 transformed_batch = model.transform(data)
 with torch.no_grad():
     logits = model(transformed_batch)

 # AIDO.Protein-RAG-3B
+AIDO.Protein-RAG-3B (AIDO.RAGPLM) is a pretrained Retrieval-Augmented protein language model within an [AI-driven Digital Organism](https://arxiv.org/abs/2412.06993) framework. This model, along with [AIDO.RAGFold](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1), integrates pretrained protein language models with retrieved Multiple Sequence Alignments (MSA), enabling the incorporation of co-evolutionary information for structure prediction while compensating for limited MSA data through large-scale pretraining.
+AIDO.Protein-RAG-3B outperforms single-sequence protein language models in perplexity, contact prediction, and fitness prediction. When used as a feature extractor for structure prediction in [AIDO.RAGFold](https://www.biorxiv.org/content/10.1101/2024.12.02.626519v1), it achieves TM-scores comparable to AlphaFold2 with sufficient MSA data (8x faster runtime), and significantly surpasses AlphaFold2 in MSA-limited scenarios (∆TM-score=0.379, 0.116, and 0.059 for 0, 5, and 10 input sequences respectively).
+## Model Architecture
+AIDO.Protein-RAG-3B employs a transformer encoder-only architecture with dense MLP layers in each block (Panel **c** below). The model uses single amino acid tokenization and is optimized via masked language modeling (MLM).
 <center><img src="architecture.png" alt="An Overview of AIDO.Protein" style="width:90%; height:auto;" /></center>
 | FFN Hidden Size    | 6832  |
 | Context Length     | 12.8K |
+## Pre-training
+### Data Preparation
 **UniRef50/Uniclust30 MSA dataset**: We utilized sequences from UniRef50 as queries to search for homologous sequences in UniClust30, subsequently constructing multiple sequence alignments (MSAs). UniRef50 comprises a total of 53.6 million sequences. Using HHblits, we searched all sequences, identifying over 25 homologous sequences for 23.7 million of them. This dataset was directly used as the training set, referred to as `HHblits_MSA`. The remaining 29.9 million sequences were input into MSA Retriever, resulting in 7.7 million sequences with more than 25 homologous sequences. This dataset was designated as `Retriever_MSA`. During training, RAGPLM randomly sampled from the two datasets with probabilities of 0.75 and 0.25
 <center><img src="unsupervised_contact_prediction.png" alt="unsupervised_contact_prediction" style="width:90%; height:auto;" /></center>
+### Supervised downstream tasks
 <center><img src="supervised_tasks.png" alt="supervised_tasks" style="width:90%; height:auto;" /></center>
 ## How to Use
+### Build Downstream Models Using ModelGenerator
 For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
 mgen test --model SequenceClassification --model.backbone aido_protein_rag_3b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
 ```
+### Use Directly in Python
 #### Embedding
 from modelgenerator.tasks import Embed
 model = Embed.from_config({"model.backbone": "aido_protein_rag_3b"}).eval()
 model.backbone.max_length = 12800
+restypes = 'ARNDCQEGHILKMFPSTWYV'
+data = {
+    'sequences': [''.join(random.choice(restypes) for _ in range(50))],
+    'msa': [ [ ''.join(random.choice(restypes+'-') for _ in range(50)) for _ in range(25) ] ],
+    'str_emb': np.random.normal(size=(1, 50, 384))
+}
 transformed_batch = model.transform(data)
 with torch.no_grad():
     embedding = model(transformed_batch)
 from modelgenerator.tasks import SequenceClassification
 model = SequenceClassification.from_config({"model.backbone": "aido_protein_rag_3b", "model.n_classes": 2}).eval()
 model.backbone.max_length = 12800
+restypes = 'ARNDCQEGHILKMFPSTWYV'
+data = {
+    'sequences': [''.join(random.choice(restypes) for _ in range(50))],
+    'msa': [ [ ''.join(random.choice(restypes+'-') for _ in range(50)) for _ in range(25) ] ],
+    'str_emb': np.random.normal(size=(1, 50, 384))
+}
 transformed_batch = model.transform(data)
 with torch.no_grad():
     logits = model(transformed_batch)
 from modelgenerator.tasks import TokenClassification
 model = TokenClassification.from_config({"model.backbone": "aido_protein_rag_3b", "model.n_classes": 3}).eval()
 model.backbone.max_length = 12800
+restypes = 'ARNDCQEGHILKMFPSTWYV'
+data = {
+    'sequences': [''.join(random.choice(restypes) for _ in range(50))],
+    'msa': [ [ ''.join(random.choice(restypes+'-') for _ in range(50)) for _ in range(25) ] ],
+    'str_emb': np.random.normal(size=(1, 50, 384))
+}
 transformed_batch = model.transform(data)
 with torch.no_grad():
     logits = model(transformed_batch)
 print(torch.argmax(logits, dim=-1))
 ```
+#### Sequence Level Regression
 ```python
+import torch
 from modelgenerator.tasks import SequenceRegression
+model = SequenceRegression.from_config({"model.backbone": "aido_protein_rag_3b"}).eval()
 model.backbone.max_length = 12800
+restypes = 'ARNDCQEGHILKMFPSTWYV'
+data = {
+    'sequences': [''.join(random.choice(restypes) for _ in range(50))],
+    'msa': [ [ ''.join(random.choice(restypes+'-') for _ in range(50)) for _ in range(25) ] ],
+    'str_emb': np.random.normal(size=(1, 50, 384))
+}
 transformed_batch = model.transform(data)
 with torch.no_grad():
     logits = model(transformed_batch)