roa7n
/

knots_protbertBFD_alphafold

Text Classification

Transformers

PyTorch

bert

Model card Files Files and versions Community

roa7n commited on Jun 5, 2023

Commit

cc901d3

1 Parent(s): 434da54

Create README.md

Browse files

Files changed (1) hide show

README.md +102 -0

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+---
+datasets:
+- EvaKlimentova/knots_AF
+metrics:
+- accuracy
+---
+# Knots ProtBert-BFD AlphaFold
+Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted.
+## Model Details
+- **Model type:** Bert
+- **Language:** proteins (amino acid sequences)
+- **Finetuned from model:** [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd)
+Model Sources:
+- **Repository:** [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments)
+- **Paper:** TBD
+## Usage
+Dataset format:
+```
+id,sequence,label
+A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1
+...
+```
+Load the dataset:
+```
+import pandas as pd
+from datasets import Dataset, load_dataset
+df = pd.read_csv(INPUT, sep=',')
+dss = Dataset.from_pandas(df)
+```
+Predict:
+```
+import torch
+import numpy as np
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
+from math import exp
+def tokenize_function(s):
+    seq_split = ' '.join(s['Sequence'])
+    return tokenizerM1(seq_split)
+tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold')
+model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold')
+tokenized_dataset = dss.map(tokenize_function, num_proc=4)
+tokenized_dataset.set_format('pt')
+tokenized_dataset
+training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none')
+trainer = Trainer(
+    model,
+    training_args,
+    train_dataset=tokenized_dataset,
+    eval_dataset=tokenized_dataset,
+    tokenizer=tokenizerM1
+)
+predictions, _, _ = trainer.predict(tokenized_dataset)
+predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions]
+df['preds'] = predictions
+```
+## Evaluation
+Per protein family metrics:
+|       M1 ProtBert-BFD        | Dataset size | Unknotted set size | Accuracy |   TPR  |   TNR  |
+|:----------------------------:|:------------:|:------------------:|:--------:|:------:|:------:|
+| All                          | 39412        | 19718              | **0.9845**   | 0.9865 | 0.9825 |
+| SPOUT                        | 7371         | 550                | 0.9887   | 0.9951 | 0.9090 |
+| TDD                          | 612          | 24                 | 0.9901   | 0.9965 | 0.8333 |
+| DUF                          | 716          | 429                | 0.9748   | 0.9721 | 0.9766 |
+| AdoMet synthase              | 1794         | 240                | 0.9899   | 0.9929 | 0.9708 |
+| Carbonic anhydrase           | 1531         | 539                | 0.9588   | 0.9737 | 0.9313 |
+| UCH                          | 477          | 125                | 0.9056   | 0.9602 | 0.7520 |
+| ATCase/OTCase                | 3799         | 3352               | 0.9994   | 0.9977 | 0.9997 |
+| ribosomal-mitochondrial      | 147          | 41                 | 0.8571   | 1.0000 | 0.4878 |
+| membrane                     | 8225         | 1493               | 0.9811   | 0.9904 | 0.9390 |
+| VIT                          | 14262        | 12555              | 0.9872   | 0.9420 | 0.9933 |
+| biosynthesis of lantibiotics | 392          | 286                | 0.9642   | 0.9528 | 0.9685 |
+## Citation [optional]
+**BibTeX:** TODO
+## Model Authors
+Simecek: [email protected]
+Klimentova: [email protected]
+Sramkova: [email protected]