|
This model was pretrained on the bookcorpus dataset using knowledge distillation. |
|
|
|
The particularity of this model is that even though it shares the same architecture as BERT, it has a hidden size of 240. Since it has 12 attention heads, the head size (20) is different from the one of the BERT base model (64). |
|
|
|
The knowledge distillation was performed using multiple loss functions. |
|
|
|
The weights of the model were initialized from scratch. |
|
|
|
PS : the tokenizer is the same as the one of the model bert-base-uncased. |
|
|
|
|
|
To load the model \& tokenizer : |
|
|
|
````python |
|
from transformers import AutoModelForMaskedLM, BertTokenizer |
|
|
|
model_name = "eli4s/Bert-L12-h240-A12" |
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
tokenizer = BertTokenizer.from_pretrained(model_name) |
|
```` |
|
|
|
To use it as a masked language model : |
|
|
|
````python |
|
import torch |
|
|
|
sentence = "Let's have a [MASK]." |
|
|
|
model.eval() |
|
encoded_inputs = tokenizer([sentence], padding='longest') |
|
input_ids = torch.tensor(encoded_inputs['input_ids']) |
|
attention_mask = torch.tensor(encoded_inputs['attention_mask']) |
|
output = model(input_ids, attention_mask=attention_mask) |
|
|
|
mask_index = input_ids.tolist()[0].index(103) |
|
masked_token = output['logits'][0][mask_index].argmax(axis=-1) |
|
predicted_token = tokenizer.decode(masked_token) |
|
|
|
print(predicted_token) |
|
```` |
|
|
|
Or we can also predict the n most relevant predictions : |
|
|
|
````python |
|
top_n = 5 |
|
|
|
vocab_size = model.config.vocab_size |
|
logits = output['logits'][0][mask_index].tolist() |
|
top_tokens = sorted(list(range(vocab_size)), key=lambda i:logits[i], reverse=True)[:top_n] |
|
|
|
tokenizer.decode(top_tokens) |
|
```` |
|
|
|
|
|
|