genbio-ai
/

AIDO.ProteinIF-16B

Model card Files Files and versions Community

smahbub commited on Dec 15, 2024

Commit

577347b

·

verified ·

1 Parent(s): bddf0f4

Update README.md

Files changed (1) hide show

README.md +27 -11

README.md CHANGED Viewed

@@ -29,17 +29,33 @@ Install [Model Generator](https://github.com/genbio-ai/modelgenerator).
 #### Outputs:
 - The evaluation score will be printed on the console.
-- The generated sequences will be stored in `./proteinIF_outputs/designed_sequences.pkl`. The content of this file looks as follows, where we have the token (amino-acid) ids of the ground truth sequences (`"true_seq"`) and predicted sequences by our method (`"pred_seq"`), stored as numpy arrays.
-  ```
-  {
-    'true_seq': [
-        array([[ 4,  8,  4,  3, 12,  5,  2, 11, 16, 15,  5,  1, 11, ...]]), ...
-    ],
-    'pred_seq': [
-        array([[ 8,  2,  4,  3, 10,  6,  2, 11, 16, 15,  6,  1, 11, ...]]), ...
-    ]
-  }
-  ```
 #### Note:

 #### Outputs:
 - The evaluation score will be printed on the console.
+- The generated sequences will be stored the folder `proteinIF_outputs/`. There will be two output files:
+  - **`./proteinIF_outputs/designed_sequences.pkl`**: This file will contain the raw token (amino-acid) IDs of the ground truth sequences (`"true_seq"`) and predicted sequences by our method (`"pred_seq"`), stored as numpy arrays. An example:
+    ```
+    {
+      'true_seq': [
+          array([[ 4,  8,  4,  3, 12,  5,  2, 11, 16, 15,  5,  1, 11, ...]]), ...
+      ],
+      'pred_seq': [
+          array([[ 8,  2,  4,  3, 10,  6,  2, 11, 16, 15,  6,  1, 11, ...]]), ...
+      ]
+    }
+    ```
+  - **`./proteinIF_outputs/results_acc_<median_accuracy>.txt`** (where median accuracy is the median accuracy calculated over all the test samples):
+    - Here, for each protein in the test set, we have three lines of information:
+      - Line1: Identity of the protein (as '`name=<PDB_ID>.<CHAIN_ID>`'), length of the squence (as '`L=<length_of_sequence>`'), and the recovery rate/accuracy for that protein sequence (as '`Recovery=<recovery_rate_of_sequence>`')
+      - Line2: *Single-letter representation* of amino-acids of the ground truth sequences (as `true:VTVGKSAPYFSL...`)
+      - Line3: *Single-letter representation* of amino-acids of the predicted sequences by our method (as `true:TAVGDEAPYFEL...`)
+    - An example file content:
+      ```
+      >name=3fkf.A | L=141 | Recovery=0.5957446694374084
+      true:VTVGKSAPYFSLPNEKGEKLSRSAERFRNRYLLLNFWASWCDPQPEANAELKRLNKEYKKNKNFAMLGISLDIDREAWETAIKKDTLSWDQVCDFTGLSSETAKQYAILTLPTNILLSPTGKILARDIQGEALTGKLKELL
+      pred:TAVGDEAPYFELPDLEGKKLSLDSEEFKNKYLLLDFWASWCLPCREEIAELKELYRRFAKNKKFAILGVSADTDKEAWLKAVKEDNLRWTQVSDFKGWDSEVFKNYNVQSLPENILLSPEGKILARGIRGEALRNKLKELL
+      >name=2d9e.A | L=121 | Recovery=0.7685950398445129
+      true:GSSGSSGFLILLRKTLEQLQEKDTGNIFSEPVPLSEVPDYLDHIKKPMDFFTMKQNLEAYRYLNFDDFEEDFNLIVSNCLKYNAKDTIFYRAAVRLREQGGAVLRQARRQAEKMGSGPSSG
+      pred:GSSGSSGRLTLLRETLEQLQERDTGWVFSEPVPLSEVPDYLDVIDHPMDFSTMRRKLEAHRYLSFDEFERDFNLIVENCRKYNAKDTVFYRAAVRLQAQGGAILRKARRDVESLGSGPSSG
+      ```
 #### Note: