AIDO.ProteinIF-16B / README.md
smahbub's picture
Update README.md
57310f0 verified
|
raw
history blame
6.53 kB
metadata
base_model:
  - genbio-ai/AIDO.Protein-16B
license: other

Protein Inverse Folding

We finetune the AIDO.Protein-16B model with LoRA on the CATH 4.2 benmark dataset. We use the same train, validation, and test splits used by the previous studies, such as LM-Design, and DPLM. Current version of ModelGenerator contains the inference pipeline for protein inverse folding. Experimental pipeline on other datasets (both training and testing) will be included in the future.

Setup:

Install ModelGenerator.

  • It is required to use docker to run our inverse folding pipeline.
  • Please set up a docker image using our provided Dockerfile and run the inverse folding inference from within the docker container.
    • Here is an example bash script to set up and access a docker container:
      # clone the ModelGenerator repository
      git clone https://github.com/genbio-ai/ModelGenerator.git
      # cd to "ModelGenerator" folder where you should find the "Dockerfile"
      cd ModelGenerator
      # create a docker image
      docker build -t aido .
      # create a local folder as ModelGenerator's data directory
      mkdir -p $HOME/mgen_data
      # run a container
      docker run -d --runtime=nvidia -it -v "$(pwd):/workspace" -v "$HOME/mgen_data:/mgen_data" aido /bin/bash
      # find the container ID
      docker ps # this will print the running containers and their IDs
      # execute the container with ID=<container_id>
      docker exec -it <container_id> /bin/bash  # now you should be inside the docker container
      # test if you can access the nvidia GPUs
      nvidia-smi # this should print the GPUs' details
      
  • Execute the following steps from within the docker container you just created.

Download model checkpoints:

  • Download all the 15 model checkpoint chunks (named as chunk_<chunk_ID>.bin) from here. Place them inside the directory ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks.

    Alternatively, you can simply run the following script to do this (Note: this script uses the wget tool):

    mkdir -p ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
    bash download_model_chunks.sh ${MGEN_DATA_DIR}/modelgenerator/huggingface_models/protein_inv_fold/AIDO.ProteinIF-16B/model_chunks
    

Download data:

  • Download the preprocessed CATH 4.2 dataset from here. You should find two files named chain_set_map.pkl and chain_set_splits.json. Place them inside the directory ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/. (Note that it was originally preprocessed by Generative Models for Graph-Based Protein Design (Ingraham et al, NeurIPS'19), and we further preprocessed it to suit our pipeline.)

    Alternatively, you can do it by simply running the following script:

    mkdir -p ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/
    wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_map.pkl
    wget -P ${MGEN_DATA_DIR}/modelgenerator/datasets/protein_inv_fold/cath_4.2/ https://huggingface.co/datasets/genbio-ai/protein-inverse-folding/resolve/main/cath-4.2/chain_set_splits.json
    

Run inference:

  • Then run the bash script for inference:
    bash prot_inverse_folding.sh
    
  • Note: Multi-GPU inference for inverse folding is not currently supported and will be included in the future.

Outputs:

  • The evaluation score will be printed on the console.
  • The generated sequences will be stored the folder proteinIF_outputs/. There will be two output files:
    • ./proteinIF_outputs/designed_sequences.pkl: This file will contain the raw token (amino-acid) IDs of the ground truth sequences ("true_seq") and predicted sequences by our method ("pred_seq"), stored as numpy arrays. An example:
      {
        'true_seq': [
            array([[ 4,  8,  4,  3, 12,  5,  2, 11, 16, 15,  5,  1, 11, ...]]), ...
        ],
        'pred_seq': [
            array([[ 8,  2,  4,  3, 10,  6,  2, 11, 16, 15,  6,  1, 11, ...]]), ...
        ]
      }
      
    • ./proteinIF_outputs/results_acc_<median_accuracy>.txt (where median accuracy is the median accuracy calculated over all the test samples):
      • Here, for each protein in the test set, we have three lines of information:
        • Line1: Identity of the protein (as 'name=<PDB_ID>.<CHAIN_ID>'), length of the squence (as 'L=<length_of_sequence>'), and the recovery rate/accuracy for that protein sequence (as 'Recovery=<recovery_rate_of_sequence>')
        • Line2: Single-letter representation of amino-acids of the ground truth sequences (as true:<sequence_of_amino_acids>)
        • Line3: Single-letter representation of amino-acids of the predicted sequences by our method (as pred:<sequence_of_amino_acids>)
      • An example file content:
        >name=3fkf.A | L=141 | Recovery=0.5957446694374084
        true:VTVGKSAPYFSLPNEKGEKLSRSAERFRNRYLLLNFWASWCDPQPEANAELKRLNKEYKKNKNFAMLGISLDIDREAWETAIKKDTLSWDQVCDFTGLSSETAKQYAILTLPTNILLSPTGKILARDIQGEALTGKLKELL
        pred:TAVGDEAPYFELPDLEGKKLSLDSEEFKNKYLLLDFWASWCLPCREEIAELKELYRRFAKNKKFAILGVSADTDKEAWLKAVKEDNLRWTQVSDFKGWDSEVFKNYNVQSLPENILLSPEGKILARGIRGEALRNKLKELL
        
        >name=2d9e.A | L=121 | Recovery=0.7685950398445129
        true:GSSGSSGFLILLRKTLEQLQEKDTGNIFSEPVPLSEVPDYLDHIKKPMDFFTMKQNLEAYRYLNFDDFEEDFNLIVSNCLKYNAKDTIFYRAAVRLREQGGAVLRQARRQAEKMGSGPSSG
        pred:GSSGSSGRLTLLRETLEQLQERDTGWVFSEPVPLSEVPDYLDVIDHPMDFSTMRRKLEAHRYLSFDEFERDFNLIVENCRKYNAKDTVFYRAAVRLQAQGGAILRKARRDVESLGSGPSSG