roa7n commited on
Commit
cc901d3
·
1 Parent(s): 434da54

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - EvaKlimentova/knots_AF
4
+ metrics:
5
+ - accuracy
6
+ ---
7
+
8
+ # Knots ProtBert-BFD AlphaFold
9
+
10
+ Fine-tuned [ProtBert-BFD](https://huggingface.co/Rostlab/prot_bert_bfd) to classify proteins as knotted vs. unknotted.
11
+
12
+ ## Model Details
13
+
14
+ - **Model type:** Bert
15
+ - **Language:** proteins (amino acid sequences)
16
+ - **Finetuned from model:** [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd)
17
+
18
+ Model Sources:
19
+
20
+ - **Repository:** [CEITEC](https://github.com/ML-Bioinfo-CEITEC/pknots_experiments)
21
+ - **Paper:** TBD
22
+
23
+ ## Usage
24
+
25
+ Dataset format:
26
+ ```
27
+ id,sequence,label
28
+ A0A2W5F4Z7,MGGIFRVNTYYTDLEPYLQSTKLPIYGALLDGENIYELVDKSKGILVIGNESKGIRSTIQNFIQKPITIPRIGQAESLNAAVATGIIVGQLTL,1
29
+ ...
30
+ ```
31
+
32
+ Load the dataset:
33
+ ```
34
+ import pandas as pd
35
+ from datasets import Dataset, load_dataset
36
+
37
+ df = pd.read_csv(INPUT, sep=',')
38
+ dss = Dataset.from_pandas(df)
39
+ ```
40
+
41
+ Predict:
42
+ ```
43
+ import torch
44
+ import numpy as np
45
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
46
+ from math import exp
47
+
48
+ def tokenize_function(s):
49
+ seq_split = ' '.join(s['Sequence'])
50
+ return tokenizerM1(seq_split)
51
+
52
+ tokenizer = AutoTokenizer.from_pretrained('roa7n/knots_protbertBFD_alphafold')
53
+ model = AutoModelForSequenceClassification.from_pretrained('roa7n/knots_protbertBFD_alphafold')
54
+
55
+ tokenized_dataset = dss.map(tokenize_function, num_proc=4)
56
+ tokenized_dataset.set_format('pt')
57
+ tokenized_dataset
58
+
59
+ training_args = TrainingArguments(<PATH>, fp16=True, per_device_eval_batch_size=50, report_to='none')
60
+
61
+ trainer = Trainer(
62
+ model,
63
+ training_args,
64
+ train_dataset=tokenized_dataset,
65
+ eval_dataset=tokenized_dataset,
66
+ tokenizer=tokenizerM1
67
+ )
68
+
69
+ predictions, _, _ = trainer.predict(tokenized_dataset)
70
+ predictions = [np.exp(p[1]) / np.sum(np.exp(p), axis=0) for p in predictions]
71
+ df['preds'] = predictions
72
+ ```
73
+
74
+ ## Evaluation
75
+
76
+ Per protein family metrics:
77
+
78
+ | M1 ProtBert-BFD | Dataset size | Unknotted set size | Accuracy | TPR | TNR |
79
+ |:----------------------------:|:------------:|:------------------:|:--------:|:------:|:------:|
80
+ | All | 39412 | 19718 | **0.9845** | 0.9865 | 0.9825 |
81
+ | SPOUT | 7371 | 550 | 0.9887 | 0.9951 | 0.9090 |
82
+ | TDD | 612 | 24 | 0.9901 | 0.9965 | 0.8333 |
83
+ | DUF | 716 | 429 | 0.9748 | 0.9721 | 0.9766 |
84
+ | AdoMet synthase | 1794 | 240 | 0.9899 | 0.9929 | 0.9708 |
85
+ | Carbonic anhydrase | 1531 | 539 | 0.9588 | 0.9737 | 0.9313 |
86
+ | UCH | 477 | 125 | 0.9056 | 0.9602 | 0.7520 |
87
+ | ATCase/OTCase | 3799 | 3352 | 0.9994 | 0.9977 | 0.9997 |
88
+ | ribosomal-mitochondrial | 147 | 41 | 0.8571 | 1.0000 | 0.4878 |
89
+ | membrane | 8225 | 1493 | 0.9811 | 0.9904 | 0.9390 |
90
+ | VIT | 14262 | 12555 | 0.9872 | 0.9420 | 0.9933 |
91
+ | biosynthesis of lantibiotics | 392 | 286 | 0.9642 | 0.9528 | 0.9685 |
92
+
93
+
94
+ ## Citation [optional]
95
+
96
+ **BibTeX:** TODO
97
+
98
+ ## Model Authors
99
+
100
+ Simecek: [email protected]
101
+ Klimentova: [email protected]
102
+ Sramkova: [email protected]