--- language: - pt metrics: - accuracy base_model: - mistralai/Mistral-7B-v0.3 pipeline_tag: text-generation library_name: transformers tags: - legal - portuguese - Brazil --- # Juru: Legal Brazilian Large Language Model from Reputable Sources This repository hosts the public checkpoints for **Juru-7B**, a Mistral-7B specialised in the Brazilian legal domain. The model was continued pretrained on **1.9 billion** unique tokens from reputable academic and legal sources in Portuguese. For full details on data curation, training, and evaluation, see our paper: . ## Checkpoints * Checkpoints were saved every **200** optimization steps up to step **3,800**. * Each 200 step interval adds **~0.4 billion** tokens of continued pretraining. * We refer to **Juru-7B** as checkpoint **3,400** (~7.1 billion tokens), which achieved the best score on our Brazilian legal knowledge benchmarks. > **Note:** The model has **not** been instruction finetuned. For best results, use few-shot inference or perform additional finetuning on your specific task. ## Citation information ```bibtex @misc{junior2024jurulegalbrazilianlarge, title={Juru: Legal Brazilian Large Language Model from Reputable Sources}, author={Roseval Malaquias Junior and Ramon Pires and Roseli Romero and Rodrigo Nogueira}, year={2024}, eprint={2403.18140}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2403.18140}, } ```