This model is a derivation from ModernBERT specialized in the Brazilian Portuguese language, pre-trained from data in this language scope.

The data used were gathered from BrWac Corpus and Wikipedia Dataset Portuguese subset.

In addition, a custom tokenizer was implemented to support ModBERTBr. This tokenizer used the Unigram Algorithm as backbone model to generate the vocabulary.

Usage

You can use these models directly with the transformers library starting from v4.48.0:

pip install -U transformers>=4.48.0

Since ModBERTBr is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM. To use ModBERTBr for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes.

โš ๏ธ If your GPU supports it, we recommend using ModBERTBr with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "wallacelw/ModBERTBr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  Paris

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="wallacelw/ModBERTBr",
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

Note: ModBERTBr does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the token_type_ids parameter.

Downloads last month
5
Safetensors
Model size
136M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support