File size: 5,572 Bytes
846a8e5
 
 
df0c3d0
fe66366
df0c3d0
e968e59
df0c3d0
fe66366
fc323c7
5fe62a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
df0c3d0
 
2e12f20
df0c3d0
2e12f20
df0c3d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75172e7
 
5fe62a3
75172e7
5fe62a3
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: apache-2.0
---
# ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

[Paper on arxiv](https://arxiv.org/abs/2402.16445) for more information

[Github](https://github.com/Lyu6PosHao/ProLLaMA) for more information

ProLLaMA is based on Llama-2-7b, so please follow the license of Llama2.

# Input Format:
The instructions which you input to the model should follow the following format:
```text
[Generate by superfamily] Superfamily=<xxx>
or
[Determine superfamily] Seq=<yyy>
```
Here are some examples of the input:
```text
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>
```
```
#You can also specify the first few amino acids of the protein sequence:
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily> Seq=<MKRVL
```
```
[Determine superfamily] Seq=<MAPGGMPREFPSFVRTLPEADLGYPALRGWVLQGERGCVLYWEAVTEVALPEHCHAECWGVVVDGRMELMVDGYTRVYTRGDLYVVPPQARHRARVFPGFRGVEHLSDPDLLPVRKR>
```
**See [this](https://github.com/Lyu6PosHao/ProLLaMA/blob/main/superfamilies.txt) on all the optional superfamilies.**

# Quick usage:
```bash
# you can replace the model_path with your local path
CUDA_VISIBLE_DEVICES=0 python main.py --model "GreatCaptainNemo/ProLLaMA" --interactive
# main.py is as follows 👇:
```

```python
import argparse
import json, os
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import GenerationConfig
from tqdm import tqdm

generation_config = GenerationConfig(
    temperature=0.2,
    top_k=40,
    top_p=0.9,
    do_sample=True,
    num_beams=1,
    repetition_penalty=1.2,
    max_new_tokens=400
)

parser = argparse.ArgumentParser()
parser.add_argument('--model', default=None, type=str,help="The local path of the model. If None, the model will be downloaded from HuggingFace")
parser.add_argument('--interactive', action='store_true',help="If True, you can input instructions interactively. If False, the input instructions should be in the input_file.")
parser.add_argument('--input_file', default=None, help="You can put all your input instructions in this file (one instruction per line).")
parser.add_argument('--output_file', default=None, help="All the outputs will be saved in this file.")
args = parser.parse_args()

if __name__ == '__main__':
    if args.interactive and args.input_file:
        raise ValueError("interactive is True, but input_file is not None.")
    if (not args.interactive) and (args.input_file is None):
        raise ValueError("interactive is False, but input_file is None.")
    if args.input_file and (args.output_file is None):
        raise ValueError("input_file is not None, but output_file is None.")

    load_type = torch.bfloat16
    if torch.cuda.is_available():
        device = torch.device(0)
    else:
        raise ValueError("No GPU available.")


    model = LlamaForCausalLM.from_pretrained(
        args.model,
        torch_dtype=load_type,
        low_cpu_mem_usage=True,
        device_map='auto',
        quantization_config=None
    )
    tokenizer = LlamaTokenizer.from_pretrained(args.model)

    model.eval()
    with torch.no_grad():
        if args.interactive:
            while True:
                raw_input_text = input("Input:")
                if len(raw_input_text.strip())==0:
                    break
                input_text = raw_input_text
                input_text = tokenizer(input_text,return_tensors="pt")  

                generation_output = model.generate(
                                input_ids = input_text["input_ids"].to(device),
                                attention_mask = input_text['attention_mask'].to(device),
                                eos_token_id=tokenizer.eos_token_id,
                                pad_token_id=tokenizer.pad_token_id,
                                generation_config = generation_config,
                                output_attentions=False
                            )
                s = generation_output[0]
                output = tokenizer.decode(s,skip_special_tokens=True)
                print("Output:",output)
                print("\n")
        else:
            outputs=[]
            with open(args.input_file, 'r') as f:
                examples =f.read().splitlines()
            print("Start generating...")
            for index, example in tqdm(enumerate(examples),total=len(examples)):
                input_text = tokenizer(example,return_tensors="pt")  #add_special_tokens=False ?

                generation_output = model.generate(
                    input_ids = input_text["input_ids"].to(device),
                    attention_mask = input_text['attention_mask'].to(device),
                    eos_token_id=tokenizer.eos_token_id,
                    pad_token_id=tokenizer.pad_token_id,
                    generation_config = generation_config
                )
                s = generation_output[0]
                output = tokenizer.decode(s,skip_special_tokens=True)
                outputs.append(output)
            with open(args.output_file,'w') as f:
                f.write("\n".join(outputs))
            print("All the outputs have been saved in",args.output_file)
```

# Citation:
```
@article{lv2024prollama,
  title={ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing},
  author={Lv, Liuzhenghao and Lin, Zongying and Li, Hao and Liu, Yuyang and Cui, Jiaxi and Chen, Calvin Yu-Chian and Yuan, Li and Tian, Yonghong},
  journal={arXiv preprint arXiv:2402.16445},
  year={2024}
}
```