File size: 7,428 Bytes
ef409a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
090ebd8
 
 
 
 
 
 
 
2a25c01
 
090ebd8
 
2a25c01
 
090ebd8
 
2a25c01
 
 
090ebd8
 
ef409a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
library_name: transformers
language:
- en
- zh
license: cc-by-4.0
base_model: Helsinki-NLP/opus-mt-zh-en
tags:
- generated_from_trainer
model-index:
- name: zhtw-en
  results: []
datasets:
- zetavg/coct-en-zh-tw-translations-twp-300k
pipeline_tag: translation
---

# zhtw-en

<details>
  <summary>English</summary>
This model translates Traditional Chinese sentences into English, with a focus on understanding Taiwanese-style Traditional Chinese and producing more accurate English translations.

This model is a fine-tuned version of [Helsinki-NLP/opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en) on the [zetavg/coct-en-zh-tw-translations-twp-300k](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) dataset.

It achieves the following results on the evaluation set:
- Loss: 2.4350
- Num Input Tokens Seen: 55653732

## Intended Uses & Limitations

### Intended Use Cases

- Translating single sentences from Chinese to English.
- Applications requiring understanding of the Chinese language as spoken in Taiwan.

### Limitations

- Designed for single-sentence translation so will not perform well on longer texts without pre-processing
- Sometimes hallucinates or omits information, especially with short or long inputs
- Further fine-tuning will address this

## Training and Evaluation Data

This model was trained and evaluated on the [Corpus of Contemporary Taiwanese Mandarin (COCT) translations](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) dataset.

- **Training Data:** 80% of the COCT dataset
- **Validation Data:** 20% of the COCT dataset
</details>

<details>
  <summary>Chinese</summary>
該模型旨在將繁體中文翻譯成英文,重點是理解台灣風格的繁體中文並產生更準確的英文翻譯。

模型基於 [Helsinki-NLP/opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en) 並在 [zetavg/coct-en-zh-tw-translations-twp-300k](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) 資料集上進行微調。

在評估集上,模型取得了以下結果:
- **損失**:2.4350
- **處理的輸入標記數量**:55,653,732

## 預期用途與限制

### 預期用途
- 將單一中文句子翻譯為英文。
- 適用於需要理解台灣中文的應用程式。

### 限制
- 本模型專為單句翻譯設計,因此在處理較長文本時可能表現不佳,若未經預處理。
- 在某些情況下,模型可能會產生幻覺或遺漏信息,特別是在輸入過短或過長的情況下。
- 進一步的微調將有助於改善這些問題。

## 訓練與評估數據

該模型使用 [當代台灣普通話語料庫 (COCT)](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) 資料集進行訓練和評估。

- **訓練資料**:COCT 資料集的 80%
- **驗證資料**:COCT 資料集的 20%
</details>

## Example

```python
from transformers import pipeline

model_checkpoint = "agentlans/zhtw-en"
translator = pipeline("translation", model=model_checkpoint)

# 摘自中文維基百科的今日文章
# From Chinese Wikipedia's article of the day
translator("《阿奇大戰鐵血戰士》是2015年4至7月黑馬漫畫和阿奇漫畫在美國發行的四期限量連環漫畫圖書,由亞歷克斯·德坎皮創作,費爾南多·魯伊斯繪圖,屬跨公司跨界作品。")[0]['translation_text']

# 輸出
# Output
# Acer's Iron Blood Fighter is a four-year series of comic books published in the United States by Black Horse and Ah Chi comics from April to July of that year. The book was created by Alexander d'Campie and painted by Philnanto Ruiz. It is a cross-firm work.

# 與我自己的黃金標準翻譯比較:
# Compare with my own gold standard translation:
# "Archie vs. Predator" is a limited four-issue comic book series published by Black Horse and Archie Comics in the United States from April to July 2015. It was created by Alex de Campi and drawn by Fernando Ruiz. It's a crossover work.
```

## Training Procedure

### Training Hyperparameters

The following hyperparameters were used during training:

- **Learning Rate:** 5e-05
- **Train Batch Size:** 8
- **Eval Batch Size:** 8
- **Seed:** 42
- **Optimizer:** adamw\_torch with betas=(0.9,0.999) and epsilon=1e-08
- **LR Scheduler Type:** linear
- **Number of Epochs:** 3.0

### Training Results

<details>
<summary>Click here to see the training and validation losses</summary>

| Training Loss | Epoch  | Step  | Validation Loss | Input Tokens Seen |
|:-------------:|:------:|:-----:|:---------------:|:-----------------:|
| 3.2254        | 0.0804 | 2500  | 2.9105          | 1493088           |
| 3.0946        | 0.1608 | 5000  | 2.8305          | 2990968           |
| 3.0473        | 0.2412 | 7500  | 2.7737          | 4477792           |
| 2.9633        | 0.3216 | 10000 | 2.7307          | 5967560           |
| 2.9355        | 0.4020 | 12500 | 2.6843          | 7463192           |
| 2.9076        | 0.4824 | 15000 | 2.6587          | 8950264           |
| 2.8714        | 0.5628 | 17500 | 2.6304          | 10443344          |
| 2.8716        | 0.6433 | 20000 | 2.6025          | 11951096          |
| 2.7989        | 0.7237 | 22500 | 2.5822          | 13432464          |
| 2.7941        | 0.8041 | 25000 | 2.5630          | 14919424          |
| 2.7692        | 0.8845 | 27500 | 2.5497          | 16415080          |
| 2.757         | 0.9649 | 30000 | 2.5388          | 17897832          |
| 2.7024        | 1.0453 | 32500 | 2.6006          | 19384812          |
| 2.7248        | 1.1257 | 35000 | 2.6042          | 20876844          |
| 2.6764        | 1.2061 | 37500 | 2.5923          | 22372340          |
| 2.6854        | 1.2865 | 40000 | 2.5793          | 23866100          |
| 2.683         | 1.3669 | 42500 | 2.5722          | 25348084          |
| 2.6871        | 1.4473 | 45000 | 2.5538          | 26854100          |
| 2.6551        | 1.5277 | 47500 | 2.5443          | 28332612          |
| 2.661         | 1.6081 | 50000 | 2.5278          | 29822156          |
| 2.6497        | 1.6885 | 52500 | 2.5266          | 31319476          |
| 2.6281        | 1.7689 | 55000 | 2.5116          | 32813220          |
| 2.6067        | 1.8494 | 57500 | 2.5047          | 34298052          |
| 2.6112        | 1.9298 | 60000 | 2.4935          | 35783604          |
| 2.5207        | 2.0102 | 62500 | 2.4946          | 37281092          |
| 2.4799        | 2.0906 | 65000 | 2.4916          | 38768588          |
| 2.4727        | 2.1710 | 67500 | 2.4866          | 40252972          |
| 2.4719        | 2.2514 | 70000 | 2.4760          | 41746300          |
| 2.4738        | 2.3318 | 72500 | 2.4713          | 43241188          |
| 2.4629        | 2.4122 | 75000 | 2.4630          | 44730244          |
| 2.4524        | 2.4926 | 77500 | 2.4575          | 46231060          |
| 2.435         | 2.5730 | 80000 | 2.4553          | 47718964          |
| 2.4621        | 2.6534 | 82500 | 2.4475          | 49209724          |
| 2.4492        | 2.7338 | 85000 | 2.4440          | 50712980          |
| 2.4536        | 2.8142 | 87500 | 2.4394          | 52204380          |
| 2.4148        | 2.8946 | 90000 | 2.4360          | 53695620          |
| 2.4243        | 2.9750 | 92500 | 2.4350          | 55190020          |

</details>

### Framework Versions

- Transformers 4.48.1
- Pytorch 2.3.0+cu121
- Datasets 3.2.0
- Tokenizers 0.21.0