File size: 5,572 Bytes
846a8e5 df0c3d0 fe66366 df0c3d0 e968e59 df0c3d0 fe66366 fc323c7 5fe62a3 df0c3d0 2e12f20 df0c3d0 2e12f20 df0c3d0 75172e7 5fe62a3 75172e7 5fe62a3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: apache-2.0
---
# ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing
[Paper on arxiv](https://arxiv.org/abs/2402.16445) for more information
[Github](https://github.com/Lyu6PosHao/ProLLaMA) for more information
ProLLaMA is based on Llama-2-7b, so please follow the license of Llama2.
# Input Format:
The instructions which you input to the model should follow the following format:
```text
[Generate by superfamily] Superfamily=<xxx>
or
[Determine superfamily] Seq=<yyy>
```
Here are some examples of the input:
```text
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>
```
```
#You can also specify the first few amino acids of the protein sequence:
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily> Seq=<MKRVL
```
```
[Determine superfamily] Seq=<MAPGGMPREFPSFVRTLPEADLGYPALRGWVLQGERGCVLYWEAVTEVALPEHCHAECWGVVVDGRMELMVDGYTRVYTRGDLYVVPPQARHRARVFPGFRGVEHLSDPDLLPVRKR>
```
**See [this](https://github.com/Lyu6PosHao/ProLLaMA/blob/main/superfamilies.txt) on all the optional superfamilies.**
# Quick usage:
```bash
# you can replace the model_path with your local path
CUDA_VISIBLE_DEVICES=0 python main.py --model "GreatCaptainNemo/ProLLaMA" --interactive
# main.py is as follows 👇:
```
```python
import argparse
import json, os
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import GenerationConfig
from tqdm import tqdm
generation_config = GenerationConfig(
temperature=0.2,
top_k=40,
top_p=0.9,
do_sample=True,
num_beams=1,
repetition_penalty=1.2,
max_new_tokens=400
)
parser = argparse.ArgumentParser()
parser.add_argument('--model', default=None, type=str,help="The local path of the model. If None, the model will be downloaded from HuggingFace")
parser.add_argument('--interactive', action='store_true',help="If True, you can input instructions interactively. If False, the input instructions should be in the input_file.")
parser.add_argument('--input_file', default=None, help="You can put all your input instructions in this file (one instruction per line).")
parser.add_argument('--output_file', default=None, help="All the outputs will be saved in this file.")
args = parser.parse_args()
if __name__ == '__main__':
if args.interactive and args.input_file:
raise ValueError("interactive is True, but input_file is not None.")
if (not args.interactive) and (args.input_file is None):
raise ValueError("interactive is False, but input_file is None.")
if args.input_file and (args.output_file is None):
raise ValueError("input_file is not None, but output_file is None.")
load_type = torch.bfloat16
if torch.cuda.is_available():
device = torch.device(0)
else:
raise ValueError("No GPU available.")
model = LlamaForCausalLM.from_pretrained(
args.model,
torch_dtype=load_type,
low_cpu_mem_usage=True,
device_map='auto',
quantization_config=None
)
tokenizer = LlamaTokenizer.from_pretrained(args.model)
model.eval()
with torch.no_grad():
if args.interactive:
while True:
raw_input_text = input("Input:")
if len(raw_input_text.strip())==0:
break
input_text = raw_input_text
input_text = tokenizer(input_text,return_tensors="pt")
generation_output = model.generate(
input_ids = input_text["input_ids"].to(device),
attention_mask = input_text['attention_mask'].to(device),
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
generation_config = generation_config,
output_attentions=False
)
s = generation_output[0]
output = tokenizer.decode(s,skip_special_tokens=True)
print("Output:",output)
print("\n")
else:
outputs=[]
with open(args.input_file, 'r') as f:
examples =f.read().splitlines()
print("Start generating...")
for index, example in tqdm(enumerate(examples),total=len(examples)):
input_text = tokenizer(example,return_tensors="pt") #add_special_tokens=False ?
generation_output = model.generate(
input_ids = input_text["input_ids"].to(device),
attention_mask = input_text['attention_mask'].to(device),
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
generation_config = generation_config
)
s = generation_output[0]
output = tokenizer.decode(s,skip_special_tokens=True)
outputs.append(output)
with open(args.output_file,'w') as f:
f.write("\n".join(outputs))
print("All the outputs have been saved in",args.output_file)
```
# Citation:
```
@article{lv2024prollama,
title={ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing},
author={Lv, Liuzhenghao and Lin, Zongying and Li, Hao and Liu, Yuyang and Cui, Jiaxi and Chen, Calvin Yu-Chian and Yuan, Li and Tian, Yonghong},
journal={arXiv preprint arXiv:2402.16445},
year={2024}
}
``` |