--- library_name: transformers base_model: airesearch/wangchanberta-base-att-spm-uncased tags: - generated_from_trainer metrics: - precision - recall - f1 - accuracy model-index: - name: wangchanberta-thainer-corpus-v2-2 results: [] datasets: - pythainlp/thainer-corpus-v2.2 --- # wangchanberta-thainer-corpus-v2-2 This model is a fine-tuned version of [airesearch/wangchanberta-base-att-spm-uncased](https://huggingface.co/airesearch/wangchanberta-base-att-spm-uncased) on an [pythainlp/thainer-corpus-v2.2](https://huggingface.co/airesearch/pythainlp/thainer-corpus-v2.2) dataset. It achieves the following results on the evaluation set: - Loss: 0.1053 - Precision: 0.8300 - Recall: 0.8870 - F1: 0.8575 - Accuracy: 0.9717 ## Model description ## Training and evaluation data Validation from the Validation set ``` {'eval_loss': 0.10526859760284424, 'eval_precision': 0.8299675891298928, 'eval_recall': 0.8870237143618439, 'eval_f1': 0.8575476558475013, 'eval_accuracy': 0.9717195641875889, 'eval_runtime': 18.2172, 'eval_samples_per_second': 80.967, 'eval_steps_per_second': 5.105, 'epoch': 10.0} ``` Test from the Test set ``` {'eval_loss': 0.11170374602079391, 'eval_precision': 0.8178285159096429, 'eval_recall': 0.8823375262054507, 'eval_f1': 0.8488591957645278, 'eval_accuracy': 0.968742017138478, 'eval_runtime': 18.6202, 'eval_samples_per_second': 79.054, 'eval_steps_per_second': 4.941, 'epoch': 10.0} ``` ## How to use Inference Huggingface doesn't support inference token classification for Thai and It will give wrong tag. You must using this code. ```python from transformers import AutoTokenizer from transformers import AutoModelForTokenClassification from pythainlp.tokenize import word_tokenize # pip install pythainlp import torch name="Porameht/wangchanberta-thainer-corpus-v2-2" tokenizer = AutoTokenizer.from_pretrained(name) model = AutoModelForTokenClassification.from_pretrained(name) sentence="นายปรเมศ คุ้มสมบัติ 552/44 หมู่ 1 บ้านหนองบัว ต.ภูหอ อ.ภูหลวง จ.เลย 42230" cut=word_tokenize(sentence.replace(" ", "<_>")) inputs=tokenizer(cut,is_split_into_words=True,return_tensors="pt") ids = inputs["input_ids"] mask = inputs["attention_mask"] # forward pass outputs = model(ids, attention_mask=mask) logits = outputs[0] predictions = torch.argmax(logits, dim=2) predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]] def fix_span_error(words,ner): _ner = [] _ner=ner _new_tag=[] for i,j in zip(words,_ner): #print(i,j) i=tokenizer.decode(i) if i.isspace() and j.startswith("B-"): j="O" if i=='' or i=='' or i=='': continue if i=="<_>": i=" " _new_tag.append((i,j)) return _new_tag ner_tag=fix_span_error(inputs['input_ids'][0],predicted_token_class) ner_tag ``` output: ``` [('นาย', 'B-PERSON'), ('ปร', 'I-PERSON'), ('เม', 'I-PERSON'), ('ศ', 'I-PERSON'), (' ', 'B-LOCATION'), ('คุ้ม', 'I-PERSON'), ('สมบัติ', 'I-PERSON'), (' ', 'O'), ('55', 'O'), ('2/', 'O'), ('44', 'O'), (' ', 'B-LOCATION'), ('หมู่', 'B-LOCATION'), (' ', 'I-LOCATION'), ('1', 'I-LOCATION'), (' ', 'B-LOCATION'), ('บ้าน', 'B-LOCATION'), ('หนอง', 'I-LOCATION'), ('บัว', 'I-LOCATION'), (' ', 'B-LOCATION'), ('ต', 'B-LOCATION'), ('.', 'I-LOCATION'), ('ภู', 'I-LOCATION'), ('หอ', 'I-LOCATION'), (' ', 'B-LOCATION'), ('อ', 'B-LOCATION'), ('.', 'I-LOCATION'), ('ภู', 'I-LOCATION'), ('หลวง', 'I-LOCATION'), (' ', 'B-LOCATION'), ('จ', 'B-LOCATION'), ('.', 'I-LOCATION'), ('เลย', 'I-LOCATION'), (' ', 'B-ZIP'), ('4', 'B-ZIP'), ('22', 'B-ZIP'), ('30', 'B-ZIP')] ``` ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 10 ### Training results | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | No log | 1.0 | 274 | 0.1790 | 0.6867 | 0.7908 | 0.7351 | 0.9484 | | 0.2681 | 2.0 | 548 | 0.1331 | 0.7788 | 0.8463 | 0.8111 | 0.9650 | | 0.2681 | 3.0 | 822 | 0.1135 | 0.8082 | 0.8766 | 0.8410 | 0.9692 | | 0.0829 | 4.0 | 1096 | 0.1053 | 0.8300 | 0.8870 | 0.8575 | 0.9717 | | 0.0829 | 5.0 | 1370 | 0.1136 | 0.8175 | 0.8868 | 0.8507 | 0.9704 | | 0.0512 | 6.0 | 1644 | 0.1135 | 0.8408 | 0.8836 | 0.8616 | 0.9723 | | 0.0512 | 7.0 | 1918 | 0.1162 | 0.8429 | 0.8894 | 0.8656 | 0.9725 | | 0.037 | 8.0 | 2192 | 0.1205 | 0.8475 | 0.8916 | 0.8690 | 0.9730 | | 0.037 | 9.0 | 2466 | 0.1237 | 0.8490 | 0.8942 | 0.8710 | 0.9732 | | 0.0275 | 10.0 | 2740 | 0.1222 | 0.8480 | 0.8934 | 0.8701 | 0.9733 | ### Framework versions - Transformers 4.47.1 - Pytorch 2.5.1+cu121 - Datasets 3.2.0 - Tokenizers 0.21.0