Jailbreak-generator / README.md
tsq2000's picture
Update README.md
76770f7 verified
|
raw
history blame
8.43 kB
---
license: mit
language:
- en
tags:
- llm
- safety
- jailbreak
- knowledge
---
# Introduction
This is a model for generating a jailbreak prompt based on knowledge point texts. The model is trained on the Llama-2-7b dataset and fine-tuned on the Knowledge-to-Jailbreak dataset. The model is intended to bridge the gap between theoretical vulnerabilities and real-world application scenarios, simulating sophisticated adversarial attacks that incorporate specialized knowledge.
Our proposed method and dataset serve as a critical starting point for both offensive and defensive research, enabling the development of new techniques to enhance the security and robustness of language models in practical settings.
# How to load the model and tokenizer
We provide two helper functions for loading the model and tokenizer.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification
import os
import json
from peft import PeftModel
# from trl import AutoModelForCausalLMWithValueHead
from transformers import AutoModelForCausalLM as AutoGPTQForCausalLM
def load_tokenizer(dir_or_model):
​ """
​ This function is used to load the tokenizer for a specific pre-trained model.
​
​ Args:
​ dir_or_model: It can be either a directory containing the pre-training model configuration details or a pretrained model.
​
​ Returns:
​ It returns a tokenizer that can convert text to tokens for the specific model input.
​ """
​ is_lora_dir = os.path.isfile(os.path.join(dir_or_model, "adapter_config.json"))
​ if is_lora_dir:
​ loaded_json = json.load(open(os.path.join(dir_or_model, "adapter_config.json"), "r"))
​ model_name = loaded_json["base_model_name_or_path"]
​ else:
​ model_name = dir_or_model
​
​ if os.path.isfile(os.path.join(dir_or_model, "config.json")):
​ loaded_json = json.load(open(os.path.join(dir_or_model, "config.json"), "r"))
​ if "_name_or_path" in loaded_json:
​ model_name = loaded_json["_name_or_path"]
​ local_model_name = "/data3/MODELS/llama2-hf/llama-2-7b"#/data2/tsq/WaterBench/data/models/llama-2-7b-chat-hf
​
​ print(">>>>>>>>>>>>>>>>>>>>>>>>>>notice this<<<<<<<<<<<<<<<<<<<<<<<<<<<<")
​
​ #print(model_name)
​ tokenizer = AutoTokenizer.from_pretrained(local_model_name)
​ if tokenizer.pad_token is None:
​ tokenizer.pad_token = tokenizer.eos_token
​ tokenizer.pad_token_id = tokenizer.eos_token_id
​
​ return tokenizer
def load_model(dir_or_model, classification=False, token_classification=False, return_tokenizer=False, dtype=torch.bfloat16, load_dtype=True,
​ rl=False, peft_config=None, device_map="auto", revision='main'):
​ """
​ This function is used to load a model based on several parameters including the type of task it is targeted to perform.
​
​ Args:
​ dir_or_model: It can be either a directory containing the pre-training model configuration details or a pretrained model.
​ classification (bool): If True, loads the model for sequence classification.
​ token_classification (bool): If True, loads the model for token classification.
​ return_tokenizer (bool): If True, returns the tokenizer along with the model.
​ dtype: The data type that PyTorch should use internally to store the model’s parameters and do the computation.
​ load_dtype (bool): If False, sets dtype as torch.float32 regardless of the passed dtype value.
​ rl (bool): If True, loads model specifically designed to be used in reinforcement learning environment.
​ peft_config: Configuration details for Peft models.
​
​ Returns:
​ It returns a model for the required task along with its tokenizer, if specified.
​ """
​ is_lora_dir = os.path.isfile(os.path.join(dir_or_model, "adapter_config.json"))
​ if not load_dtype:
​ dtype = torch.float32
​ if is_lora_dir:
​ loaded_json = json.load(open(os.path.join(dir_or_model, "adapter_config.json"), "r"))
​ model_name = loaded_json["base_model_name_or_path"]
​ else:
​ model_name = dir_or_model
​ original_model_name = model_name
​ if classification:
​ model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float32, use_auth_token=True, device_map=device_map, revision=revision) # to investigate: calling torch_dtype here fails.
​ elif token_classification:
​ model = AutoModelForTokenClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float32, use_auth_token=True, device_map=device_map, revision=revision)
​ else:
​ if model_name.endswith("GPTQ") or model_name.endswith("GGML"):
​ model = AutoGPTQForCausalLM.from_quantized(model_name,
​ use_safetensors=True,
​ trust_remote_code=True,
​ \# use_triton=True, # breaks currently, unfortunately generation time of the GPTQ model is quite slow
​ quantize_config=None, device_map=device_map)
​ else:
​ print('11111111111111111111111111111111111111')
​ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float32, use_auth_token=True, device_map=device_map, revision=revision)
​ if is_lora_dir:
​ model = PeftModel.from_pretrained(model, dir_or_model)
​
​ try:
​ tokenizer = load_tokenizer(original_model_name)
​ model.config.pad_token_id = tokenizer.pad_token_id
​ except Exception:
​ pass
​ if return_tokenizer:
​ return model, load_tokenizer(original_model_name)
​ return model
model_name = 'tsq2000/Jailbreak-generator'
model = load_model(model_name)
tokenizer = load_tokenizer(model_name)
```
# How to generate jailbreak prompts
Here is an example of how to generate jailbreak prompts based on knowledge point texts.
```python
model_name = 'tsq2000/Jailbreak-generator'
model = load_model(model_name)
tokenizer = load_tokenizer(model_name)
max_length = 2048
max_tokens = 64
knowledge_points = ["Kettling Kettling (also known as containment or corralling) is a police tactic for controlling large crowds during demonstrations or protests. It involves the formation of large cordons of police officers who then move to contain a crowd within a limited area. Protesters are left only one choice of exit controlled by the police – or are completely prevented from leaving, with the effect of denying the protesters access to food, water and toilet facilities for a time period determined by the police forces. The tactic has proved controversial, in part because it has resulted in the detention of ordinary bystanders."]
batch_texts = [f'### Input:\n{input_}\n\n### Response:\n' for input_ in knowledge_points]
inputs = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length - max_tokens).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=max_tokens, num_return_sequences=1, do_sample=False, temperature=1, top_p=1, eos_token_id=tokenizer.eos_token_id)
generated_texts = []
for output, input_text in zip(outputs, batch_texts):
​ text = tokenizer.decode(output, skip_special_tokens=True)
​ generated_texts.append(text[len(input_text):])
print(generated_texts)
```
# Citation
If you find this model useful, please cite the following paper:
```
@misc{tu2024knowledgetojailbreak,
​ title={Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack},
​ author={Shangqing Tu and Zhuoran Pan and Wenxuan Wang and Zhexin Zhang and Yuliang Sun and Jifan Yu and Hongning Wang and Lei Hou and Juanzi Li},
​ year={2024},
​ eprint={2406.11682},
​ archivePrefix={arXiv},
​ primaryClass={cs.CL},
​ url={https://arxiv.org/abs/2406.11682},
}
```