roa7n's picture
Create README.md
cc901d3
---
datasets:
- EvaKlimentova/knots_AF
metrics:
- accuracy
---
# Knots ProtBert-BFD AlphaFold
Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted.
## Model Details
- **Model type:** Bert
- **Language:** proteins (amino acid sequences)
- **Finetuned from model:** [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd)
Model Sources:
- **Repository:** [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments)
- **Paper:** TBD
## Usage
Dataset format:
```
id,sequence,label
A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1
...
```
Load the dataset:
```
import pandas as pd
from datasets import Dataset, load_dataset
df = pd.read_csv(INPUT, sep=',')
dss = Dataset.from_pandas(df)
```
Predict:
```
import torch
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from math import exp
def tokenize_function(s):
seq_split = ' '.join(s['Sequence'])
return tokenizerM1(seq_split)
tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold')
model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold')
tokenized_dataset = dss.map(tokenize_function, num_proc=4)
tokenized_dataset.set_format('pt')
tokenized_dataset
training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none')
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset,
eval_dataset=tokenized_dataset,
tokenizer=tokenizerM1
)
predictions, _, _ = trainer.predict(tokenized_dataset)
predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions]
df['preds'] = predictions
```
## Evaluation
Per protein family metrics:
| M1 ProtBert-BFD | Dataset size | Unknotted set size | Accuracy | TPR | TNR |
|:----------------------------:|:------------:|:------------------:|:--------:|:------:|:------:|
| All | 39412 | 19718 | **0.9845** | 0.9865 | 0.9825 |
| SPOUT | 7371 | 550 | 0.9887 | 0.9951 | 0.9090 |
| TDD | 612 | 24 | 0.9901 | 0.9965 | 0.8333 |
| DUF | 716 | 429 | 0.9748 | 0.9721 | 0.9766 |
| AdoMet synthase | 1794 | 240 | 0.9899 | 0.9929 | 0.9708 |
| Carbonic anhydrase | 1531 | 539 | 0.9588 | 0.9737 | 0.9313 |
| UCH | 477 | 125 | 0.9056 | 0.9602 | 0.7520 |
| ATCase/OTCase | 3799 | 3352 | 0.9994 | 0.9977 | 0.9997 |
| ribosomal-mitochondrial | 147 | 41 | 0.8571 | 1.0000 | 0.4878 |
| membrane | 8225 | 1493 | 0.9811 | 0.9904 | 0.9390 |
| VIT | 14262 | 12555 | 0.9872 | 0.9420 | 0.9933 |
| biosynthesis of lantibiotics | 392 | 286 | 0.9642 | 0.9528 | 0.9685 |
## Citation [optional]
**BibTeX:** TODO
## Model Authors
Simecek: [email protected]
Klimentova: [email protected]
Sramkova: [email protected]