|
--- |
|
datasets: |
|
- EvaKlimentova/knots_AF |
|
metrics: |
|
- accuracy |
|
--- |
|
|
|
# Knots ProtBert-BFD AlphaFold |
|
|
|
Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted. |
|
|
|
## Model Details |
|
|
|
- **Model type:** Bert |
|
- **Language:** proteins (amino acid sequences) |
|
- **Finetuned from model:** [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd) |
|
|
|
Model Sources: |
|
|
|
- **Repository:** [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments) |
|
- **Paper:** TBD |
|
|
|
## Usage |
|
|
|
Dataset format: |
|
``` |
|
id,sequence,label |
|
A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1 |
|
... |
|
``` |
|
|
|
Load the dataset: |
|
``` |
|
import pandas as pd |
|
from datasets import Dataset, load_dataset |
|
|
|
df = pd.read_csv(INPUT, sep=',') |
|
dss = Dataset.from_pandas(df) |
|
``` |
|
|
|
Predict: |
|
``` |
|
import torch |
|
import numpy as np |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments |
|
from math import exp |
|
|
|
def tokenize_function(s): |
|
seq_split = ' '.join(s['Sequence']) |
|
return tokenizerM1(seq_split) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold') |
|
model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold') |
|
|
|
tokenized_dataset = dss.map(tokenize_function, num_proc=4) |
|
tokenized_dataset.set_format('pt') |
|
tokenized_dataset |
|
|
|
training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none') |
|
|
|
trainer = Trainer( |
|
model, |
|
training_args, |
|
train_dataset=tokenized_dataset, |
|
eval_dataset=tokenized_dataset, |
|
tokenizer=tokenizerM1 |
|
) |
|
|
|
predictions, _, _ = trainer.predict(tokenized_dataset) |
|
predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions] |
|
df['preds'] = predictions |
|
``` |
|
|
|
## Evaluation |
|
|
|
Per protein family metrics: |
|
|
|
| M1 ProtBert-BFD | Dataset size | Unknotted set size | Accuracy | TPR | TNR | |
|
|:----------------------------:|:------------:|:------------------:|:--------:|:------:|:------:| |
|
| All | 39412 | 19718 | **0.9845** | 0.9865 | 0.9825 | |
|
| SPOUT | 7371 | 550 | 0.9887 | 0.9951 | 0.9090 | |
|
| TDD | 612 | 24 | 0.9901 | 0.9965 | 0.8333 | |
|
| DUF | 716 | 429 | 0.9748 | 0.9721 | 0.9766 | |
|
| AdoMet synthase | 1794 | 240 | 0.9899 | 0.9929 | 0.9708 | |
|
| Carbonic anhydrase | 1531 | 539 | 0.9588 | 0.9737 | 0.9313 | |
|
| UCH | 477 | 125 | 0.9056 | 0.9602 | 0.7520 | |
|
| ATCase/OTCase | 3799 | 3352 | 0.9994 | 0.9977 | 0.9997 | |
|
| ribosomal-mitochondrial | 147 | 41 | 0.8571 | 1.0000 | 0.4878 | |
|
| membrane | 8225 | 1493 | 0.9811 | 0.9904 | 0.9390 | |
|
| VIT | 14262 | 12555 | 0.9872 | 0.9420 | 0.9933 | |
|
| biosynthesis of lantibiotics | 392 | 286 | 0.9642 | 0.9528 | 0.9685 | |
|
|
|
|
|
## Citation [optional] |
|
|
|
**BibTeX:** TODO |
|
|
|
## Model Authors |
|
|
|
Simecek: [email protected] |
|
Klimentova: [email protected] |
|
Sramkova: [email protected] |