FBogaerts commited on
Commit
f204b06
·
verified ·
1 Parent(s): da6d8d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -3
README.md CHANGED
@@ -1,3 +1,103 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: prithivMLmods/Codepy-Deepthink-3B
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ tags:
8
+ - text-generation
9
+ - code-generation
10
+ - vulnerability-injection
11
+ - security
12
+ - vaitp
13
+ - finetuned
14
+ pretty_name: "FBogaerts/Codepy-Deepthink-3B-Finetuned Finetuned for Vulnerability Injection"
15
+ ---
16
+
17
+ # FBogaerts/Codepy-Deepthink-3B Finetuned for Vulnerability Injection (VAITP)
18
+
19
+ This model is a fine-tuned version of **prithivMLmods/Codepy-Deepthink-3B** specialized for the task of security vulnerability injection in Python code. It has been trained to follow a specific instruction format to precisely modify code snippets and introduce vulnerabilities.
20
+
21
+ This model was developed as part of the research for our paper: *(coming soon)*.
22
+
23
+ The VAITP CLI Framework and related resources can be found at our [GitHub repository](coming soon).
24
+
25
+ ## Model Description
26
+
27
+ This model was fine-tuned to act as a "Coder" LLM. It takes a specific instruction set and a piece of original Python code, and its objective is to return the modified code with the requested vulnerability injected.
28
+
29
+ The model excels when prompted using the specific format it was trained on.
30
+
31
+ ## Intended Uses & Limitations
32
+
33
+ **Intended Use**
34
+
35
+ This model is intended for research purposes in the field of automated security testing, SAST/DAST tool evaluation, and the generation of training data for security-aware models. It should be used within a sandboxed environment to inject vulnerabilities into non-production code for analysis.
36
+
37
+ **Out-of-Scope Uses**
38
+
39
+ This model should **NOT** be used for:
40
+ - Generating malicious code for use in real-world attacks.
41
+ - Directly modifying production codebases.
42
+ - Any application outside of controlled, ethical security research.
43
+
44
+ The generated code should always be manually reviewed before use.
45
+
46
+ ## How to Use
47
+
48
+ This model expects a very specific prompt format, which we call the `FINETUNED_STYLE` in our paper. The format is:
49
+
50
+ `{instruction} _BREAK_ {original_code}`
51
+
52
+ Here is an example using `transformers`:
53
+
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+
57
+ model_name = "FBogaerts/Codepy-Deepthink-3B-Finetuned"
58
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
59
+ model = AutoModelForCausalLM.from_pretrained(model_name)
60
+
61
+ instruction = "Modify the function to introduce a OS Command Injection vulnerability. The vulnerable code must contain the pattern: 'User-controlled input is used in a subprocess call with shell=True'."
62
+ original_code = "import subprocess\ndef execute(cmd):\n subprocess.run(cmd, shell=False)"
63
+
64
+ prompt = f"{instruction} _BREAK_ {original_code}"
65
+
66
+ inputs = tokenizer(prompt, return_tensors="pt")
67
+ outputs = model.generate(**inputs, max_new_tokens=256)
68
+
69
+ vulnerable_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
70
+ # The model will output the full modified code block.
71
+ # Further cleaning may be needed to extract only the code.
72
+ print(vulnerable_code)
73
+ ```
74
+ Training Procedure
75
+
76
+ Training Data
77
+
78
+ The model was fine-tuned on a dataset of 1,406 examples derived from the DeVAITP Vulnerability Corpus. Each example consists of a triplet: (instruction, original_code, vulnerable_code). The instructions were generated using the meta-prompting technique described in our paper, with meta-llama/Meta-Llama-3.1-8B-Instruct serving as the Planner model.
79
+
80
+ Training Hyperparameters
81
+
82
+ The model was fine-tuned using the following key hyperparameters:
83
+
84
+ Framework: Hugging Face TRL
85
+
86
+ Learning Rate: 2e-5
87
+
88
+ Number of Epochs: 1
89
+
90
+ Batch Size: 1
91
+
92
+ Hardware: Google Colab (L4 GPU)
93
+
94
+ Evaluation
95
+
96
+ (coming soon)
97
+
98
+ Citation
99
+
100
+ If you use this model in your research, please cite our paper:
101
+ (BibTeX entry will be provided upon publication)
102
+
103
+