---
base_model:
- genbio-ai/AIDO.Protein-16B
license: other
---

# Protein Inverse Folding
We finetune the [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.DNA-16B) model with LoRA on the [CATH 4.2](https://pubmed.ncbi.nlm.nih.gov/9309224/) benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as [LM-Design](https://arxiv.org/abs/2302.01649), and [DPLM](https://arxiv.org/abs/2402.18567). Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.

#### Setup:
Install [Model Generator](https://github.com/genbio-ai/modelgenerator). 
- It is **required** to use [docker](https://www.docker.com/101-tutorial/) to run our inverse folding pipeline.
- Please set up a docker image using our provided [Dockerfile](https://github.com/genbio-ai/ModelGenerator/blob/main/Dockerfile) and run the inverse folding inference from within the docker container.

#### Running inference:

- Set the environment variable for ModelGenerator's data directory:
    ```
    export MGEN_DATA_DIR=~/mgen_data # or any other local directory of your choice
    ```
- Download all the 15 model checkpoint chunks (named as `chunk_<chunk_ID>.bin`) from [here](https://huggingface.co/genbio-ai/AIDO.Protein-16B-inverse_folding). Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.Protein-16B-inverse_folding/model_chunks`.

- Download the CATH 4.2 dataset preprocessed by [Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19)](https://papers.nips.cc/paper_files/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf) from [here](http://people.csail.mit.edu/ingraham/graph-protein-design/data/cath/). You should find two files named `chain_set.jsonl` and `chain_set_splits.json`. Place them inside the directory `${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/`.

- Then run the bash script:
    ```
    bash prot_inverse_folding.sh
    ```
- **Note:** Multi-GPU inference for inverse folding is not currently supported and will be included in the future.

#### Outputs:
- The evaluation score will be printed on the console. 
- The generated sequences will be stored the folder `proteinIF_outputs/`. There will be two output files:
  - **`./proteinIF_outputs/designed_sequences.pkl`**: This file will contain the raw token (amino-acid) IDs of the ground truth sequences (`"true_seq"`) and predicted sequences by our method (`"pred_seq"`), stored as numpy arrays. An example:
    ```
    {
      'true_seq': [
          array([[ 4,  8,  4,  3, 12,  5,  2, 11, 16, 15,  5,  1, 11, ...]]), ...
      ],
      'pred_seq': [
          array([[ 8,  2,  4,  3, 10,  6,  2, 11, 16, 15,  6,  1, 11, ...]]), ...
      ]
    }
    ```
  - **`./proteinIF_outputs/results_acc_<median_accuracy>.txt`** (where median accuracy is the median accuracy calculated over all the test samples):
    - Here, for each protein in the test set, we have three lines of information:
      - Line1: Identity of the protein (as '`name=<PDB_ID>.<CHAIN_ID>`'), length of the squence (as '`L=<length_of_sequence>`'), and the recovery rate/accuracy for that protein sequence (as '`Recovery=<recovery_rate_of_sequence>`')
      - Line2: *Single-letter representation* of amino-acids of the ground truth sequences (as `true:<sequence_of_amino_acids>`)
      - Line3: *Single-letter representation* of amino-acids of the predicted sequences by our method (as `pred:<sequence_of_amino_acids>`)
    - An example file content:
      ```
      >name=3fkf.A | L=141 | Recovery=0.5957446694374084
      true:VTVGKSAPYFSLPNEKGEKLSRSAERFRNRYLLLNFWASWCDPQPEANAELKRLNKEYKKNKNFAMLGISLDIDREAWETAIKKDTLSWDQVCDFTGLSSETAKQYAILTLPTNILLSPTGKILARDIQGEALTGKLKELL
      pred:TAVGDEAPYFELPDLEGKKLSLDSEEFKNKYLLLDFWASWCLPCREEIAELKELYRRFAKNKKFAILGVSADTDKEAWLKAVKEDNLRWTQVSDFKGWDSEVFKNYNVQSLPENILLSPEGKILARGIRGEALRNKLKELL
      
      >name=2d9e.A | L=121 | Recovery=0.7685950398445129
      true:GSSGSSGFLILLRKTLEQLQEKDTGNIFSEPVPLSEVPDYLDHIKKPMDFFTMKQNLEAYRYLNFDDFEEDFNLIVSNCLKYNAKDTIFYRAAVRLREQGGAVLRQARRQAEKMGSGPSSG
      pred:GSSGSSGRLTLLRETLEQLQERDTGWVFSEPVPLSEVPDYLDVIDHPMDFSTMRRKLEAHRYLSFDEFERDFNLIVENCRKYNAKDTVFYRAAVRLQAQGGAILRKARRDVESLGSGPSSG
      ```