---
library_name: transformers
base_model: airesearch/wangchanberta-base-att-spm-uncased
tags:
- generated_from_trainer
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: wangchanberta-thainer-corpus-v2-2
  results: []
datasets:
- pythainlp/thainer-corpus-v2.2
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# wangchanberta-thainer-corpus-v2-2

This model is a fine-tuned version of [airesearch/wangchanberta-base-att-spm-uncased](https://huggingface.co/airesearch/wangchanberta-base-att-spm-uncased) on an [pythainlp/thainer-corpus-v2.2](https://huggingface.co/airesearch/pythainlp/thainer-corpus-v2.2) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1053
- Precision: 0.8300
- Recall: 0.8870
- F1: 0.8575
- Accuracy: 0.9717

## Model description

## Training and evaluation data

Validation from the Validation set
```
{'eval_loss': 0.10526859760284424,
 'eval_precision': 0.8299675891298928,
 'eval_recall': 0.8870237143618439,
 'eval_f1': 0.8575476558475013,
 'eval_accuracy': 0.9717195641875889,
 'eval_runtime': 18.2172,
 'eval_samples_per_second': 80.967,
 'eval_steps_per_second': 5.105,
 'epoch': 10.0}
```

Test from the Test set
```
{'eval_loss': 0.11170374602079391,
 'eval_precision': 0.8178285159096429,
 'eval_recall': 0.8823375262054507,
 'eval_f1': 0.8488591957645278,
 'eval_accuracy': 0.968742017138478,
 'eval_runtime': 18.6202,
 'eval_samples_per_second': 79.054,
 'eval_steps_per_second': 4.941,
 'epoch': 10.0}
```

## How to use

Inference

Huggingface doesn't support inference token classification for Thai and It will give wrong tag. You must using this code.

```python
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
from pythainlp.tokenize import word_tokenize # pip install pythainlp
import torch

name="Porameht/wangchanberta-thainer-corpus-v2-2"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForTokenClassification.from_pretrained(name)

sentence="นายปรเมศ คุ้มสมบัติ 552/44 หมู่ 1 บ้านหนองบัว ต.ภูหอ อ.ภูหลวง จ.เลย 42230"
cut=word_tokenize(sentence.replace(" ", "<_>"))
inputs=tokenizer(cut,is_split_into_words=True,return_tensors="pt")

ids = inputs["input_ids"]
mask = inputs["attention_mask"]
# forward pass
outputs = model(ids, attention_mask=mask)
logits = outputs[0]

predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]

def fix_span_error(words,ner):
    _ner = []
    _ner=ner
    _new_tag=[]
    for i,j in zip(words,_ner):
        #print(i,j)
        i=tokenizer.decode(i)
        if i.isspace() and j.startswith("B-"):
            j="O"
        if i=='' or i=='<s>' or i=='</s>':
            continue
        if i=="<_>":
            i=" "
        _new_tag.append((i,j))
    return _new_tag

ner_tag=fix_span_error(inputs['input_ids'][0],predicted_token_class)
ner_tag
```
output:
```
[('นาย', 'B-PERSON'),
 ('ปร', 'I-PERSON'),
 ('เม', 'I-PERSON'),
 ('ศ', 'I-PERSON'),
 (' ', 'B-LOCATION'),
 ('คุ้ม', 'I-PERSON'),
 ('สมบัติ', 'I-PERSON'),
 (' ', 'O'),
 ('55', 'O'),
 ('2/', 'O'),
 ('44', 'O'),
 (' ', 'B-LOCATION'),
 ('หมู่', 'B-LOCATION'),
 (' ', 'I-LOCATION'),
 ('1', 'I-LOCATION'),
 (' ', 'B-LOCATION'),
 ('บ้าน', 'B-LOCATION'),
 ('หนอง', 'I-LOCATION'),
 ('บัว', 'I-LOCATION'),
 (' ', 'B-LOCATION'),
 ('ต', 'B-LOCATION'),
 ('.', 'I-LOCATION'),
 ('ภู', 'I-LOCATION'),
 ('หอ', 'I-LOCATION'),
 (' ', 'B-LOCATION'),
 ('อ', 'B-LOCATION'),
 ('.', 'I-LOCATION'),
 ('ภู', 'I-LOCATION'),
 ('หลวง', 'I-LOCATION'),
 (' ', 'B-LOCATION'),
 ('จ', 'B-LOCATION'),
 ('.', 'I-LOCATION'),
 ('เลย', 'I-LOCATION'),
 (' ', 'B-ZIP'),
 ('4', 'B-ZIP'),
 ('22', 'B-ZIP'),
 ('30', 'B-ZIP')]
```

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 10

### Training results

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log        | 1.0   | 274  | 0.1790          | 0.6867    | 0.7908 | 0.7351 | 0.9484   |
| 0.2681        | 2.0   | 548  | 0.1331          | 0.7788    | 0.8463 | 0.8111 | 0.9650   |
| 0.2681        | 3.0   | 822  | 0.1135          | 0.8082    | 0.8766 | 0.8410 | 0.9692   |
| 0.0829        | 4.0   | 1096 | 0.1053          | 0.8300    | 0.8870 | 0.8575 | 0.9717   |
| 0.0829        | 5.0   | 1370 | 0.1136          | 0.8175    | 0.8868 | 0.8507 | 0.9704   |
| 0.0512        | 6.0   | 1644 | 0.1135          | 0.8408    | 0.8836 | 0.8616 | 0.9723   |
| 0.0512        | 7.0   | 1918 | 0.1162          | 0.8429    | 0.8894 | 0.8656 | 0.9725   |
| 0.037         | 8.0   | 2192 | 0.1205          | 0.8475    | 0.8916 | 0.8690 | 0.9730   |
| 0.037         | 9.0   | 2466 | 0.1237          | 0.8490    | 0.8942 | 0.8710 | 0.9732   |
| 0.0275        | 10.0  | 2740 | 0.1222          | 0.8480    | 0.8934 | 0.8701 | 0.9733   |


### Framework versions

- Transformers 4.47.1
- Pytorch 2.5.1+cu121
- Datasets 3.2.0
- Tokenizers 0.21.0