Model Card for Model ID

periquito

Model Details

Model Description

Periquito-3B is a large language model (LLM) trained by Wandgibaut. It is built upon the OpenLlama-3B architecture and specifically fine-tuned using Portuguese Wikipedia (pt-br) data. This specialization makes it particularly adept at understanding and generating text in Brazilian Portuguese.

  • Developed by: Wandemberg Gibaut
  • Model type: Llama
  • Language(s) (NLP): Portuguese
  • License: Apache License 2.0
  • Finetuned from model [optional]: openlm-research/open_llama_3b

Loading the Weights with Hugging Face Transformers

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
model_path = 'wandgibaut/periquito-3B'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)
prompt = 'Q: Qual o maior animal terrestre?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32
)
print(tokenizer.decode(generation_output[0]))

For more advanced usage, please follow the transformers LLaMA documentation.

Evaluating with LM-Eval-Harness

The model can be evaluated with lm-eval-harness. However, we used a custom version, that has some translated tasks and the ENEM suit. This can be found in wandgibaut/lm-evaluation-harness-PTBR.

Dataset and Training

We finetunned the model on Wikipedia-pt dataset with LoRA, in Google's TPU-v3 in the Google's TPU Research program.

Evaluation

We evaluated OpenLLaMA on a wide range of tasks using lm-evaluation-harness. The LLaMA results are generated by running the original LLaMA model on the same evaluation metrics. We note that our results for the LLaMA model differ slightly from the original LLaMA paper, which we believe is a result of different evaluation protocols. Similar differences have been reported in this issue of lm-evaluation-harness. Additionally, we present the results of GPT-J, a 6B parameter model trained on the Pile dataset by EleutherAI.

hf-causal (pretrained=wandgibaut/periquito-3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
agnews_pt 0 acc 0.6184 ยฑ 0.0056
boolq_pt 1 acc 0.6333 ยฑ 0.0084
faquad 1 exact 7.9365
f1 45.6971
HasAns_exact 7.9365
HasAns_f1 45.6971
NoAns_exact 0.0000
NoAns_f1 0.0000
best_exact 7.9365
best_f1 45.6971
imdb_pt 0 acc 0.6338 ยฑ 0.0068
sst2_pt 1 acc 0.6823 ยฑ 0.0158
toldbr 0 acc 0.4629 ยฑ 0.0109
f1_macro 0.3164

hf-causal (pretrained=wandgibaut/periquito-3B,dtype=float), limit: None, provide_description: False, num_fewshot: 3, batch_size: None

Task Version Metric Value Stderr
agnews_pt 0 acc 0.6242 ยฑ 0.0056
boolq_pt 1 acc 0.6477 ยฑ 0.0084
faquad 1 exact 34.9206
f1 70.3968
HasAns_exact 34.9206
HasAns_f1 70.3968
NoAns_exact 0.0000
NoAns_f1 0.0000
best_exact 34.9206
best_f1 70.3968
imdb_pt 0 acc 0.8408 ยฑ 0.0052
sst2_pt 1 acc 0.7775 ยฑ 0.0141
toldbr 0 acc 0.5143 ยฑ 0.0109
f1_macro 0.5127

hf-causal (pretrained=wandgibaut/periquito-3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task Version Metric Value Stderr
enem 0 acc 0.1976 ยฑ 0.0132
2009 0.2022 ยฑ 0.0428
2016 0.1809 ยฑ 0.0399
2015 0.1348 ยฑ 0.0364
2016_2_ 0.2366 ยฑ 0.0443
2017 0.2022 ยฑ 0.0428
2013 0.1647 ยฑ 0.0405
2012 0.2174 ยฑ 0.0432
2011 0.2292 ยฑ 0.0431
2010 0.2157 ยฑ 0.0409
2014 0.1839 ยฑ 0.0418
enem_2022 0 acc 0.2373 ยฑ 0.0393
2022 0.2373 ยฑ 0.0393
human-sciences 0.2703 ยฑ 0.0740
mathematics 0.1818 ยฑ 0.0842
natural-sciences 0.1538 ยฑ 0.0722
languages 0.3030 ยฑ 0.0812
enem_CoT 0 acc 0.1812 ยฑ 0.0127
2009 0.1348 ยฑ 0.0364
2016 0.1596 ยฑ 0.0380
2015 0.1124 ยฑ 0.0337
2016_2_ 0.1290 ยฑ 0.0350
2017 0.2247 ยฑ 0.0445
2013 0.1765 ยฑ 0.0416
2012 0.2391 ยฑ 0.0447
2011 0.1979 ยฑ 0.0409
2010 0.2451 ยฑ 0.0428
2014 0.1839 ยฑ 0.0418
enem_CoT_2022 0 acc 0.2119 ยฑ 0.0378
2022 0.2119 ยฑ 0.0378
human-sciences 0.2703 ยฑ 0.0740
mathematics 0.1818 ยฑ 0.0842
natural-sciences 0.2308 ยฑ 0.0843
languages 0.1515 ยฑ 0.0634

hf-causal (pretrained=wandgibaut/periquito-3B,dtype=float), limit: None, provide_description: False, num_fewshot: 1, batch_size: None

Task Version Metric Value Stderr
enem 0 acc 0.1790 ยฑ 0.0127
2009 0.1573 ยฑ 0.0388
2016 0.2021 ยฑ 0.0416
2015 0.1573 ยฑ 0.0388
2016_2_ 0.1935 ยฑ 0.0412
2017 0.2247 ยฑ 0.0445
2013 0.1412 ยฑ 0.0380
2012 0.1739 ยฑ 0.0397
2011 0.1979 ยฑ 0.0409
2010 0.1961 ยฑ 0.0395
2014 0.1379 ยฑ 0.0372
enem_2022 0 acc 0.1864 ยฑ 0.0360
2022 0.1864 ยฑ 0.0360
human-sciences 0.2432 ยฑ 0.0715
mathematics 0.1364 ยฑ 0.0749
natural-sciences 0.1154 ยฑ 0.0639
languages 0.2121 ยฑ 0.0723
enem_CoT 0 acc 0.2009 ยฑ 0.0132
2009 0.2135 ยฑ 0.0437
2016 0.2340 ยฑ 0.0439
2015 0.1348 ยฑ 0.0364
2016_2_ 0.2258 ยฑ 0.0436
2017 0.2360 ยฑ 0.0453
2013 0.1529 ยฑ 0.0393
2012 0.1957 ยฑ 0.0416
2011 0.2500 ยฑ 0.0444
2010 0.1667 ยฑ 0.0371
2014 0.1954 ยฑ 0.0428
enem_CoT_2022 0 acc 0.2542 ยฑ 0.0403
2022 0.2542 ยฑ 0.0403
human-sciences 0.2703 ยฑ 0.0740
mathematics 0.2273 ยฑ 0.0914
natural-sciences 0.3846 ยฑ 0.0973
languages 0.1515 ยฑ 0.0634

Use Cases:

The model is suitable for text generation, language understanding, and various natural language processing tasks in Brazilian Portuguese.

Limitations:

Like many language models, Periquito-3B might exhibit biases present in its training data. Additionally, its performance is primarily optimized for Portuguese, potentially limiting its effectiveness with other languages.

Ethical Considerations:

Users are encouraged to use the model ethically, particularly by avoiding the generation of harmful or biased content.

Acknowledgment

We thank the Google TPU Research Cloud program for providing part of the computation resources.

Citation [optional]

If you found periquito-3B useful in your research or applications, please cite using the following BibTeX:

BibTeX:

@software{wandgibautperiquito3B,
  author = {Gibaut, Wandemberg},
  title = {Periquito-3B},
  month = Sep,
  year = 2023,
  url = {https://huggingface.co/wandgibaut/periquito-3B}
}

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Average 33.04
ENEM Challenge (No Images) 17.98
BLUEX (No Images) 21.14
OAB Exams 22.69
Assin2 RTE 43.01
Assin2 STS 8.92
FaQuAD NLI 43.97
HateBR Binary 50.46
PT Hate Speech Binary 41.19
tweetSentBR 47.96
Downloads last month
21
Safetensors
Model size
3.55B params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train wandgibaut/periquito-3B

Space using wandgibaut/periquito-3B 1

Evaluation results