Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,183 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
-
#
|
7 |
|
8 |
-
|
|
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
10 |
|
|
|
|
|
11 |
|
12 |
-
|
|
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
-
|
18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
-
|
20 |
-
- **Developed by:** [More Information Needed]
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
|
28 |
### Model Sources [optional]
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
|
|
|
|
75 |
|
76 |
## Training Details
|
77 |
|
78 |
### Training Data
|
79 |
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
|
84 |
### Training Procedure
|
85 |
|
86 |
-
|
87 |
-
|
88 |
-
#### Preprocessing [optional]
|
89 |
-
|
90 |
-
[More Information Needed]
|
91 |
-
|
92 |
|
93 |
-
|
|
|
|
|
94 |
|
95 |
-
|
|
|
|
|
96 |
|
97 |
-
|
98 |
|
99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
100 |
|
101 |
-
|
102 |
|
103 |
-
|
|
|
|
|
|
|
|
|
104 |
|
105 |
-
|
106 |
|
107 |
-
### Testing Data, Factors & Metrics
|
108 |
|
109 |
-
####
|
|
|
|
|
|
|
|
|
110 |
|
111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
112 |
|
113 |
-
|
114 |
|
115 |
-
####
|
116 |
|
117 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
118 |
|
119 |
-
|
120 |
|
121 |
-
|
|
|
|
|
122 |
|
123 |
-
|
|
|
|
|
|
|
124 |
|
125 |
-
|
|
|
|
|
|
|
126 |
|
127 |
-
|
128 |
|
129 |
-
|
|
|
130 |
|
131 |
-
|
132 |
|
|
|
133 |
|
|
|
|
|
|
|
|
|
|
|
|
|
134 |
|
135 |
-
##
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
|
175 |
**BibTeX:**
|
|
|
|
|
|
|
176 |
|
177 |
-
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
|
199 |
-
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
tags:
|
4 |
+
- biology
|
5 |
+
- bacteria
|
6 |
+
- prokaryotes
|
7 |
+
- genomics
|
8 |
+
- protein
|
9 |
+
- plm
|
10 |
+
- cplm
|
11 |
+
license: apache-2.0
|
12 |
+
metrics:
|
13 |
+
- accuracy
|
14 |
---
|
15 |
|
16 |
+
# bacformer-causal-MAG
|
17 |
|
18 |
+
Bacformer is a foundational model for bacterial genomics, modeling the whole bacterial genome as a sequence of proteins.
|
19 |
+
Bacformer takes as input the set of proteins present in a genome, **ordered** by their positions on the chromosome and plasmid(s), and
|
20 |
+
computes contextual protein representations with a transformer that conditions each protein on every other protein in the genome.
|
21 |
+
Each protein is treated as a token; thus, Bacformer may be viewed as a **contextualized protein language model** that captures the protein–protein interactions
|
22 |
+
underlying an organism’s phenotype.
|
23 |
|
24 |
+
Bacformer is pretrained to predict a masked (or next) protein *family* given the remaining proteins in the genome.
|
25 |
+
The base model was trained on ~1.3 M diverse bacterial genomes comprising ~3 B protein sequences and can be adapted to a wide range of downstream tasks.
|
26 |
+
See the [Bacformer models](https://huggingface.co/collections/macwiatrak/bacformer-681a17d6a77a928a1531def2) collection for all available checkpoints.
|
27 |
|
28 |
+
A key member of this collection is **bacformer-causal-MAG**, a 27M-parameter base transformer model trained on ~1.3M
|
29 |
+
diverse metagenome-assembled genomes across more than 20k bacterial species and multiple environments.
|
30 |
|
31 |
+
All Bacformer variants embed protein sequences with a base protein language model by averaging the amino-acid token embeddings across each sequence.
|
32 |
+
Bacformer uses [ESM-2 t12 35 M](https://huggingface.co/facebook/esm2_t12_35M_UR50D) as its base model.
|
33 |
|
34 |
+
- **Developed by:** University of Cambridge (Floto Lab) & EPFL (Brbić Lab), led by Maciej Wiatrak
|
35 |
+
- **License:** Apache 2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
### Model Sources [optional]
|
38 |
|
39 |
+
- **Repository:** [https://github.com/macwiatrak/Bacformer](https://github.com/macwiatrak/Bacformer)
|
40 |
+
- **Paper:** TBA
|
41 |
+
|
42 |
+
## Usage
|
43 |
+
|
44 |
+
Install the `bacformer` package (see [https://github.com/macwiatrak/Bacformer](https://github.com/macwiatrak/Bacformer)). An end-to-end Python example demonstrating how to embed a genome with Bacformer is provided in the *tutorials* folder.
|
45 |
+
|
46 |
+
Below snippet shows how you can embed multiple protein sequences with `Bacformer`.
|
47 |
+
|
48 |
+
```python
|
49 |
+
import torch
|
50 |
+
from transformers import AutoModel
|
51 |
+
from bacformer.pp import protein_seqs_to_bacformer_inputs
|
52 |
+
|
53 |
+
device = "cuda:0"
|
54 |
+
model = AutoModel.from_pretrained("macwiatrak/bacformer-causal-MAG", trust_remote_code=True).to(device).eval().to(torch.bfloat16)
|
55 |
+
|
56 |
+
# Example input: a sequence of protein sequences
|
57 |
+
# in this case: 4 toy protein sequences
|
58 |
+
# Bacformer was trained with a maximum nr of proteins of 6000.
|
59 |
+
protein_sequences = [
|
60 |
+
"MGYDLVAGFQKNVRTI",
|
61 |
+
"MKAILVVLLG",
|
62 |
+
"MQLIESRFYKDPWGNVHATC",
|
63 |
+
"MSTNPKPQRFAWL",
|
64 |
+
]
|
65 |
+
# embed the proteins with ESM-2 to get average protein embeddings
|
66 |
+
inputs = protein_seqs_to_bacformer_inputs(
|
67 |
+
protein_sequences,
|
68 |
+
device=device,
|
69 |
+
batch_size=128, # the batch size for computing the protein embeddings
|
70 |
+
max_n_proteins=6000, # the maximum number of proteins Bacformer was trained with
|
71 |
+
)
|
72 |
+
|
73 |
+
# move the inputs to the device
|
74 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
75 |
+
# compute contextualized protein embeddings with Bacformer
|
76 |
+
with torch.no_grad():
|
77 |
+
outputs = model(**inputs, return_dict=True)
|
78 |
+
|
79 |
+
print('last hidden state shape:', outputs["last_hidden_state"].shape) # (batch_size, max_length, hidden_size)
|
80 |
+
print('genome embedding:', outputs.last_hidden_state.mean(dim=1).shape) # (batch_size, hidden_size)
|
81 |
+
```
|
82 |
+
|
83 |
+
### Tutorials
|
84 |
+
|
85 |
+
We include a number of tutorials, see the [tutorials](https://github.com/macwiatrak/Bacformer) folder on the github repository.
|
86 |
|
87 |
## Training Details
|
88 |
|
89 |
### Training Data
|
90 |
|
91 |
+
**bacformer-causal-MAG** was pretrained on the **MAG corpus**
|
92 |
+
(≈1.3 M metagenome-assembled genomes).
|
93 |
+
The MAG set maximises environmental and taxonomic diversity and contains roughly 3B protein sequences.
|
94 |
|
95 |
### Training Procedure
|
96 |
|
97 |
+
#### Preprocessing
|
|
|
|
|
|
|
|
|
|
|
98 |
|
99 |
+
Each genome is represented as an ordered list of proteins.
|
100 |
+
ranslated protein sequences were obtained from annotations or translated de novo when necessary.
|
101 |
+
Proteins were ordered by genomic coordinates; for MAGs, proteins within each contig were ordered, and the contig order was randomised in every epoch.
|
102 |
|
103 |
+
The set of protein sequences in the genome are embedded with the base protein language model ([ESM-2 t12 35M](https://huggingface.co/facebook/esm2_t12_35M_UR50D)).
|
104 |
+
The protein embeddings computed by averaging the amino acid tokens in a protein sequence are used as protein tokens.
|
105 |
+
Finally the protein tokens are fed into a transformer model.
|
106 |
|
107 |
+
During pretraining, we limit the maximum number of proteins in a genome to `6,000`, which covers whole bacterial genome in `>98%` of genomes present in out training corpus.
|
108 |
|
|
|
109 |
|
110 |
+
### Pretraining
|
111 |
|
112 |
+
#### Training objective
|
113 |
+
The model was optimised to predict the next protein given previous proteins present in the genome. As the number of protein is effectively unbound, we assigned each protein a discrete protein family index by performing
|
114 |
+
unsupervised clustering on a set of proteins, resulting in `50k` distinct protein family clusters.
|
115 |
+
Importantly, the input to Bacformer are exact protein sequence and we only use the discrete protein family label in a final classification layer where predicting
|
116 |
+
the protein family of masked proteins. This allows the model to work on amino acid level tasks where even single mutations can change the phenotype of a genome.
|
117 |
|
118 |
+
The causal pretraining objective effectively predicts the protein family ID of the next protein given the embedding of the previous protein in the genome sequence.
|
119 |
|
|
|
120 |
|
121 |
+
#### Pretraining details
|
122 |
+
The initial pretraining on MAGs was trained on 4 A100 80GB NVIDIA GPUs, with an effective batch size of 32. The maximum sequence length of each protein
|
123 |
+
was set to `1,024` and the maximum number of proteins in a genome was set to `6,000`, which covers whole bacterial genome in `>98%` of genomes present in our training corpus.
|
124 |
+
The Adam optimizer [1] was used with a linear warmup learning rate schedule, with the number of warmup steps equal to `7,500`. The base learning rate of `0.00015` was used,
|
125 |
+
scaled by square root of number of GPUs (`lr = args.lr * np.sqrt(max(n_gpus, 1))`). We monitor the loss on the validation set as the measure of performance during training.
|
126 |
|
|
|
127 |
|
128 |
+
### Architecture
|
129 |
|
130 |
+
#### Input embeddings
|
131 |
|
132 |
+
The input embeddings are created by adding together 1) protein embeddings from a base protein language model (pLM) [ESM-2 t12 35M](https://huggingface.co/facebook/esm2_t12_35M_UR50D),
|
133 |
+
2) contig (token type) embeddings. The 1) are created by embedding a protein sequence with the pretrained base pLM model and taking the average of all amino acid
|
134 |
+
tokens in the sequence, resulting in a `D` dimensional vector. We embed all of the `N` proteins present in the whole bacterial genome resulting in a `N x D` matrix,
|
135 |
+
where `D` is the dimension of the base pLM model, here `480`. The protein embeddings are added together with the 2) contig embeddings. The contig embeddings
|
136 |
+
are learnable embeddings which represent the unique contigs present in the genome. As an example, if a genome is made up of `K` contigs, each containing a number of proteins,
|
137 |
+
each protein within the same contig will have the same contig embedding, which is different from the embeddings of different contigs.
|
138 |
+
Contig embeddings have been created to account for the fact that bacterial genomes are often made up of chromosome and plasmid(s) and are frequently collated by combining
|
139 |
+
multiple contigs together (metagenome assembled genomes).
|
140 |
+
The contig embeddings are initialised and train from scratch at the beginning of pretraining.
|
141 |
|
142 |
+
The genomic organisation is highly important for bacteria, to model it we employ **rotary positional embeddings** [2].
|
143 |
|
144 |
+
The input embeddings are ordered by their order on the contig/chromosome/plasmid. This [...]. Additionally, we include special tokens. Specifically, 1) `[CLS]` token
|
145 |
+
at the start of the sequence, 2) `[SEP]` token between the contigs or chromosomes(s)/plasmid(s), 3) `[END]` token at the end of the genome. The example below show how does
|
146 |
+
the genome representation look like for complete genomes and MAGs.
|
147 |
|
148 |
+
**Complete genomes**
|
149 |
+
```
|
150 |
+
[CLS] [chromosome1_gene1] [chromosome2_gene2] ... [chromosomme1_geneN] [SEP] [plasmid1_gene1] ... [plasmid1_geneM] [END]
|
151 |
+
```
|
152 |
|
153 |
+
**Metagenome-assembled genome (MAG)**
|
154 |
+
```
|
155 |
+
[CLS] [contig1_gene1] ... [contig1_geneN] [SEP] [contig2_gene1] ... [contig2_geneM] [SEP] ... [contigZgeneV] [END]
|
156 |
+
```
|
157 |
|
158 |
+
#### Transformer backbone
|
159 |
|
160 |
+
The input embeddings are fed into a transformer, which computes protein representations conditional on other proteins present in the genome
|
161 |
+
by computing self-attention between them, resulting in **contextual protein embeddings**.
|
162 |
|
163 |
+
The transformer is 12 layer transformer with `hidden_dim=480` and is trained from scratch. Bacformer leverages the flash attention available in `pytorch>=2.2`.
|
164 |
|
165 |
+
#### Pretraining classification head
|
166 |
|
167 |
+
Bacformer is pretrained to predict masked or next protein family based on the other proteins present in the genome. Given the embedding from the last hidden state of
|
168 |
+
the transformer of masked or the previous protein, the classification head predicts the protein family. We predict protein family, rather than a protein, because the space of
|
169 |
+
possible proteins is effectively unbound. To get a discrete vocabulary of proteins, we we assigned each protein a discrete protein family index by performing
|
170 |
+
unsupervised clustering on a set of proteins, resulting in `50k` distinct protein family clusters. Importantly, the input to Bacformer are
|
171 |
+
exact protein sequences present in the whole bacterial genome, rather than protein family tokens. This allows the model to work on amino acid level tasks where even single mutations
|
172 |
+
can change the phenotype of a genome, while still allowing for pretraining.
|
173 |
|
174 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
175 |
|
176 |
**BibTeX:**
|
177 |
+
```
|
178 |
+
TBA
|
179 |
+
```
|
180 |
|
181 |
+
## Contact
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
182 |
|
183 |
+
In case of questions/issues or feature requests please raise an issue on github - https://github.com/macwiatrak/Bacformer.
|