---

license: apache-2.0
language: c++
tags:
- code-generation
- codellama
- peft
- unit-tests
- causal-lm
- text-generation
- embedded-systems
base_model: codellama/CodeLlama-7b-hf
model_type: llama
pipeline_tag: text-generation
---
# 🧪 CodeLLaMA Unit Test Generator — Full Merged Model (v2)

This is a **merged model** that combines [`codellama/CodeLlama-7b-hf`](https://huggingface.co/codellama/CodeLlama-7b-hf) with a LoRA adapter 
fine-tuned on embedded C/C++ code and high-quality unit tests using GoogleTest and CppUTest. This version includes enhanced formatting, stop tokens, 
and test cleanup mechanisms.


---

## 🎯 Use Cases

-  Generate comprehensive unit tests for embedded C/C++ functions
-  Focus on edge cases, boundaries, error handling
---

## 🧠 Training Summary

- Base model: `codellama/CodeLlama-7b-hf`
- LoRA fine-tuned with:
  - Special tokens: `<|system|>`, `<|user|>`, `<|assistant|>`, `// END_OF_TESTS`
  - Instruction-style prompts
  - Explicit test output formatting
  - Cleaned test labels via regex stripping headers/main
- Datasets: [`athrv/Embedded_Unittest2`](https://huggingface.co/datasets/athrv/Embedded_Unittest2)

---

## 📌 Example Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Utkarsh524/codellama_utests_full_new_ver2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

prompt = """<|system|>
Generate comprehensive unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
Output Constraints:
1. ONLY include test code (no explanations, headers, or main functions)
2. Start directly with TEST(...)
3. End after last test case
4. Never include framework boilerplate
<|user|>
Create tests for:
int add(int a, int b) { return a + b; }
<|assistant|>
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, eos_token_id=tokenizer.convert_tokens_to_ids("// END_OF_TESTS"))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```


##  Training & Optimization Details

| Step                | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| **Dataset**         | athrv/Embedded_Unittest2 (filtered for valid code-test pairs)               |
| **Preprocessing**   | Token length filtering (≤4096), special token injection                     |
| **Quantization**    | 8-bit (BitsAndBytesConfig), llm_int8_threshold=6.0                         |
| **LoRA Config**     | r=64, alpha=32, dropout=0.1 on q_proj/v_proj/k_proj/o_proj                 |
| **Training**        | 4 epochs, batch=4 (effective 8), lr=2e-4, FP16                             |
| **Optimization**    | Paged AdamW 8-bit, gradient checkpointing, custom data collator            |
| **Special Tokens**  | Added `<|system|>`, `<|user|>`, `<|assistant|>`                            |

---

##  Tips for Best Results

- **Temperature:** 0.2–0.4  
- **Top-p:** 0.85–0.95  
- **Max New Tokens:** 256–512-1024-2048  
- **Input Formatting:**
  - Include complete function signatures
  - Remove unnecessary comments
  - Keep functions under 200 lines
  - For long functions, split into logical units

---

##  Feedback & Citation

**Dataset Credit:** `athrv/Embedded_Unittest2`  
**Report Issues:** [Model's Hugging Face page](https://huggingface.co/Utkarsh524/codellama_utests_full_new_ver2)

**Maintainer:** Utkarsh524  
**Model Version:** v2 (4-epoch trained)
---