--- license: apache-2.0 language: c++ tags: - code-generation - codellama - peft - unit-tests - causal-lm - text-generation - embedded-systems base_model: codellama/CodeLlama-7b-hf model_type: llama pipeline_tag: text-generation --- # ๐Ÿงช CodeLLaMA Unit Test Generator โ€” Full Merged Model (v2) This is a **merged model** that combines [`codellama/CodeLlama-7b-hf`](https://huggingface.co/codellama/CodeLlama-7b-hf) with a LoRA adapter fine-tuned on embedded C/C++ code and high-quality unit tests using GoogleTest and CppUTest. This version includes enhanced formatting, stop tokens, and test cleanup mechanisms. --- ## ๐ŸŽฏ Use Cases - Generate comprehensive unit tests for embedded C/C++ functions - Focus on edge cases, boundaries, error handling --- ## ๐Ÿง  Training Summary - Base model: `codellama/CodeLlama-7b-hf` - LoRA fine-tuned with: - Special tokens: `<|system|>`, `<|user|>`, `<|assistant|>`, `// END_OF_TESTS` - Instruction-style prompts - Explicit test output formatting - Cleaned test labels via regex stripping headers/main - Datasets: [`athrv/Embedded_Unittest2`](https://huggingface.co/datasets/athrv/Embedded_Unittest2) --- ## ๐Ÿ“Œ Example Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "Utkarsh524/codellama_utests_full_new_ver2" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto") prompt = """<|system|> Generate comprehensive unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios. Output Constraints: 1. ONLY include test code (no explanations, headers, or main functions) 2. Start directly with TEST(...) 3. End after last test case 4. Never include framework boilerplate <|user|> Create tests for: int add(int a, int b) { return a + b; } <|assistant|> """ inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=512, eos_token_id=tokenizer.convert_tokens_to_ids("// END_OF_TESTS")) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training & Optimization Details | Step | Description | |---------------------|-----------------------------------------------------------------------------| | **Dataset** | athrv/Embedded_Unittest2 (filtered for valid code-test pairs) | | **Preprocessing** | Token length filtering (โ‰ค4096), special token injection | | **Quantization** | 8-bit (BitsAndBytesConfig), llm_int8_threshold=6.0 | | **LoRA Config** | r=64, alpha=32, dropout=0.1 on q_proj/v_proj/k_proj/o_proj | | **Training** | 4 epochs, batch=4 (effective 8), lr=2e-4, FP16 | | **Optimization** | Paged AdamW 8-bit, gradient checkpointing, custom data collator | | **Special Tokens** | Added `<|system|>`, `<|user|>`, `<|assistant|>` | --- ## Tips for Best Results - **Temperature:** 0.2โ€“0.4 - **Top-p:** 0.85โ€“0.95 - **Max New Tokens:** 256โ€“512-1024-2048 - **Input Formatting:** - Include complete function signatures - Remove unnecessary comments - Keep functions under 200 lines - For long functions, split into logical units --- ## Feedback & Citation **Dataset Credit:** `athrv/Embedded_Unittest2` **Report Issues:** [Model's Hugging Face page](https://huggingface.co/Utkarsh524/codellama_utests_full_new_ver2) **Maintainer:** Utkarsh524 **Model Version:** v2 (4-epoch trained) ---