language: | |
- pt | |
metrics: | |
- accuracy | |
base_model: | |
- mistralai/Mistral-7B-v0.3 | |
pipeline_tag: text-generation | |
library_name: transformers | |
tags: | |
- legal | |
- portuguese | |
- Brazil | |
# Juru: Legal Brazilian Large Language Model from Reputable Sources | |
This repository hosts the public checkpoints for **Juru-7B**, a Mistral-7B specialised in the Brazilian legal domain. The model was continued pretrained on **1.9 billion** unique tokens from reputable academic and legal sources in Portuguese. For full details on data curation, training, and evaluation, see our paper: <https://arxiv.org/abs/2403.18140>. | |
## Checkpoints | |
* Checkpoints were saved every **200** optimization steps up to step **3,800**. | |
* Each 200 step interval adds **~0.4 billion** tokens of continued pretraining. | |
* We refer to **Juru-7B** as checkpoint **3,400** (~7.1 billion tokens), which achieved the best score on our Brazilian legal knowledge benchmarks. | |
> **Note:** The model has **not** been instruction finetuned. For best results, use few-shot inference or perform additional finetuning on your specific task. | |
## Citation information | |
```bibtex | |
@misc{junior2024jurulegalbrazilianlarge, | |
title={Juru: Legal Brazilian Large Language Model from Reputable Sources}, | |
author={Roseval Malaquias Junior and Ramon Pires and Roseli Romero and Rodrigo Nogueira}, | |
year={2024}, | |
eprint={2403.18140}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL}, | |
url={https://arxiv.org/abs/2403.18140}, | |
} | |
``` |