llama-3.2-3B-GLR (GRPO Legal Reasoning)

This repository provides a Llama 3.2 3B model fine-tuned on a legal Q&A dataset using GRPO (Group Relative Policy Optimization) and LoRA adapters for legal_reasoning outputs.

Usage

Download the files first, then run the below code in inference.py

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("legal-grpo/checkpoint-500")
model = AutoModelForCausalLM.from_pretrained("legal-grpo/checkpoint-500")

prompt = """
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a legal assistant. Provide legal information in this format:
<legal_analysis>...analysis...</legal_analysis>
<answer>...final answer...</answer>
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What are the elements of a valid contract?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

user_question = "What are the elements of a valid contract?"
system_prompt = f"""{prompt} + {user_question}"""
inputs = tokenizer(system_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0])
print(response)

Model Details

Base Model: meta-llama/Llama-3.2-3B-Instruct
Fine-tuning: GRPO with LoRA adapters
Dataset: Legal Q&A axondendriteplus/legal-qna-dataset
Output Format: Structured with <legal_analysis>...</legal_analysis> and <answer>...</answer> tags

Training

Reward model uses Gemini's (gemini-2.5-flash-preview-04-17) for evaluating accuracy, completeness, and quality during training.
LoRA adapters are used for efficient fine-tuning.

Inference

See inference.py for a ready-to-use example.

Training procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Citations

@article{zhihong2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

axondendriteplus
/

llama-3.2-3B-GLR