🏛️ Caselaw-CPT-7B

This model is a Continued Pretrained Language Model (CPT) based on a 7B parameter LLaMA architecture, further trained on the common-pile/caselaw_access_project dataset — a large corpus of U.S. court decisions.

The goal of this model is to adapt a general LLM to legal language, style, and reasoning patterns, enabling more accurate and fluent completions, clause generation, and preliminary legal Q&A.

🧠 Model Summary

Feature	Value
Base model	LLaMA-3 7B / Unsloth (4bit)
Type	Continued Pretraining (CPT)
Domain	Legal (Caselaw, U.S.)
Dataset	Caselaw Access Project
Context length	2048
Quantization	4-bit via Unsloth

📜 Intended Use

This model is ideal for:

📄 Legal clause completions
🤖 Legal-style generation (opinions, summaries)
❓ Basic legal Q&A (zero-shot, surprisingly strong)
📚 Precursor to legal instruction tuning

🚀 Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("yasserrmd/caselaw-cpt-8b").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("yasserrmd/caselaw-cpt-8b")

prompt = "Q: What are the three conditions for res ipsa loquitur to apply?\nA:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        top_p=0.9,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧪 Example Output

“The three conditions are that the defendant had exclusive control over the instrumentality that caused the plaintiff's injury, the plaintiff's injury was the type of injury that would not occur in the absence of negligence, and the plaintiff was not guilty of contributory negligence.”

📊 Evaluation

While not instruction-tuned, the model demonstrates:

✅ Strong fluency in legal domain terminology
✅ Structured legal logic in completions
✅ Promising zero-shot performance on legal questions

For downstream evaluation, consider testing on:

LegalBench QA
SODA Law
Instruct-tuned legal tasks (summarization, classification)

📦 Files Included

config.json
pytorch_model.bin or adapter_model.bin (if LoRA)
tokenizer_config.json
tokenizer.model / tokenizer.json
(Optional) generation_config.json

📄 License & Use

The base model follows its original license (Meta AI or Unsloth), and the Caselaw Access Project is in the public domain. This checkpoint is provided for research and educational use only. Not intended for actual legal advice or decisions.

🙏 Acknowledgments

📚 Harvard Caselaw Access Project
🛠️ Unsloth.ai for efficient fine-tuning
☁️ Google Colab Pro for training infrastructure
🧠 Hugging Face for hosting & tooling

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

yasserrmd
/

caselaw-cpt-8b