---
language:
- pt
metrics:
- accuracy
base_model:
- mistralai/Mistral-7B-v0.3
pipeline_tag: text-generation
library_name: transformers
tags:
- legal
- portuguese
- Brazil
---

# Juru: Legal Brazilian Large Language Model from Reputable Sources

This repository hosts the public checkpoints for **Juru-7B**, a Mistral-7B specialised in the Brazilian legal domain. The model was continued pretrained on **1.9 billion** unique tokens from reputable academic and legal sources in Portuguese. For full details on data curation, training, and evaluation, see our paper: <https://arxiv.org/abs/2403.18140>.

## Checkpoints

* Checkpoints were saved every **200** optimization steps up to step **3,800**.  
* Each 200 step interval adds **~0.4 billion** tokens of continued pretraining.
* We refer to **Juru-7B** as checkpoint **3,400** (~7.1 billion tokens), which achieved the best score on our Brazilian legal knowledge benchmarks.

> **Note:** The model has **not** been instruction finetuned. For best results, use few-shot inference or perform additional finetuning on your specific task.

## Citation information

```bibtex
@misc{junior2024jurulegalbrazilianlarge,
      title={Juru: Legal Brazilian Large Language Model from Reputable Sources}, 
      author={Roseval Malaquias Junior and Ramon Pires and Roseli Romero and Rodrigo Nogueira},
      year={2024},
      eprint={2403.18140},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2403.18140}, 
}
```