FBogaerts
/

CodePy-3B-Finetuned

+---
+license: mit
+base_model: prithivMLmods/Codepy-Deepthink-3B
+language:
+- en
+library_name: transformers
+tags:
+- text-generation
+- code-generation
+- vulnerability-injection
+- security
+- vaitp
+- finetuned
+pretty_name: "FBogaerts/Codepy-Deepthink-3B-Finetuned Finetuned for Vulnerability Injection"
+---
+# FBogaerts/Codepy-Deepthink-3B Finetuned for Vulnerability Injection (VAITP)
+This model is a fine-tuned version of **prithivMLmods/Codepy-Deepthink-3B** specialized for the task of security vulnerability injection in Python code. It has been trained to follow a specific instruction format to precisely modify code snippets and introduce vulnerabilities.
+This model was developed as part of the research for our paper: *(coming soon)*.
+The VAITP CLI Framework and related resources can be found at our [GitHub repository](coming soon).
+## Model Description
+This model was fine-tuned to act as a "Coder" LLM. It takes a specific instruction set and a piece of original Python code, and its objective is to return the modified code with the requested vulnerability injected.
+The model excels when prompted using the specific format it was trained on.
+## Intended Uses & Limitations
+**Intended Use**
+This model is intended for research purposes in the field of automated security testing, SAST/DAST tool evaluation, and the generation of training data for security-aware models. It should be used within a sandboxed environment to inject vulnerabilities into non-production code for analysis.
+**Out-of-Scope Uses**
+This model should **NOT** be used for:
+-   Generating malicious code for use in real-world attacks.
+-   Directly modifying production codebases.
+-   Any application outside of controlled, ethical security research.
+The generated code should always be manually reviewed before use.
+## How to Use
+This model expects a very specific prompt format, which we call the `FINETUNED_STYLE` in our paper. The format is:
+`{instruction} _BREAK_ {original_code}`
+Here is an example using `transformers`:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model_name = "FBogaerts/Codepy-Deepthink-3B-Finetuned"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+instruction = "Modify the function to introduce a OS Command Injection vulnerability. The vulnerable code must contain the pattern: 'User-controlled input is used in a subprocess call with shell=True'."
+original_code = "import subprocess\ndef execute(cmd):\n    subprocess.run(cmd, shell=False)"
+prompt = f"{instruction} _BREAK_ {original_code}"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=256)
+vulnerable_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
+# The model will output the full modified code block.
+# Further cleaning may be needed to extract only the code.
+print(vulnerable_code)
+```
+Training Procedure
+Training Data
+The model was fine-tuned on a dataset of 1,406 examples derived from the DeVAITP Vulnerability Corpus. Each example consists of a triplet: (instruction, original_code, vulnerable_code). The instructions were generated using the meta-prompting technique described in our paper, with meta-llama/Meta-Llama-3.1-8B-Instruct serving as the Planner model.
+Training Hyperparameters
+The model was fine-tuned using the following key hyperparameters:
+    Framework: Hugging Face TRL
+    Learning Rate:  2e-5
+    Number of Epochs: 1
+    Batch Size: 1
+    Hardware: Google Colab (L4 GPU)
+Evaluation
+(coming soon)
+Citation
+If you use this model in your research, please cite our paper:
+(BibTeX entry will be provided upon publication)