zhtw-en / README.md

Update README.md

ef409a2 verified 25 days ago

6.31 kB

	---
	library_name: transformers
	language:
	- en
	- zh
	license: cc-by-4.0
	base_model: Helsinki-NLP/opus-mt-zh-en
	tags:
	- generated_from_trainer
	model-index:
	- name: zhtw-en
	results: []
	datasets:
	- zetavg/coct-en-zh-tw-translations-twp-300k
	pipeline_tag: translation
	---

	# zhtw-en

	<details>
	<summary>English</summary>
	This model translates Traditional Chinese sentences into English, with a focus on understanding Taiwanese-style Traditional Chinese and producing more accurate English translations.

	This model is a fine-tuned version of [Helsinki-NLP/opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en) on the [zetavg/coct-en-zh-tw-translations-twp-300k](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) dataset.

	It achieves the following results on the evaluation set:
	- Loss: 2.4350
	- Num Input Tokens Seen: 55653732

	## Intended Uses & Limitations

	### Intended Use Cases

	- Translating single sentences from Chinese to English.
	- Applications requiring understanding of the Chinese language as spoken in Taiwan.

	### Limitations

	- Designed for single-sentence translation so will not perform well on longer texts without pre-processing
	- Sometimes hallucinates or omits information, especially with short or long inputs
	- Further fine-tuning will address this

	## Training and Evaluation Data

	This model was trained and evaluated on the [Corpus of Contemporary Taiwanese Mandarin (COCT) translations](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) dataset.

	- Training Data: 80% of the COCT dataset
	- Validation Data: 20% of the COCT dataset
	</details>

	<details>
	<summary>Chinese</summary>
	該模型旨在將繁體中文翻譯成英文，重點是理解台灣風格的繁體中文並產生更準確的英文翻譯。

	模型基於 [Helsinki-NLP/opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en) 並在 [zetavg/coct-en-zh-tw-translations-twp-300k](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) 資料集上進行微調。

	在評估集上，模型取得了以下結果：
	- 損失：2.4350
	- 處理的輸入標記數量：55,653,732

	## 預期用途與限制

	### 預期用途
	- 將單一中文句子翻譯為英文。
	- 適用於需要理解台灣中文的應用程式。

	### 限制
	- 本模型專為單句翻譯設計，因此在處理較長文本時可能表現不佳，若未經預處理。
	- 在某些情況下，模型可能會產生幻覺或遺漏信息，特別是在輸入過短或過長的情況下。
	- 進一步的微調將有助於改善這些問題。

	## 訓練與評估數據

	該模型使用 [當代台灣普通話語料庫 (COCT)](https://huggingface.co/datasets/zetavg/coct-en-zh-tw-translations-twp-300k) 資料集進行訓練和評估。

	- 訓練資料：COCT 資料集的 80%
	- 驗證資料：COCT 資料集的 20%
	</details>

	## Training Procedure

	### Training Hyperparameters

	The following hyperparameters were used during training:

	- Learning Rate: 5e-05
	- Train Batch Size: 8
	- Eval Batch Size: 8
	- Seed: 42
	- Optimizer: adamw\_torch with betas=(0.9,0.999) and epsilon=1e-08
	- LR Scheduler Type: linear
	- Number of Epochs: 3.0

	### Training Results

	<details>
	<summary>Click here to see the training and validation losses</summary>

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Input Tokens Seen \|
	\|:-------------:\|:------:\|:-----:\|:---------------:\|:-----------------:\|
	\| 3.2254 \| 0.0804 \| 2500 \| 2.9105 \| 1493088 \|
	\| 3.0946 \| 0.1608 \| 5000 \| 2.8305 \| 2990968 \|
	\| 3.0473 \| 0.2412 \| 7500 \| 2.7737 \| 4477792 \|
	\| 2.9633 \| 0.3216 \| 10000 \| 2.7307 \| 5967560 \|
	\| 2.9355 \| 0.4020 \| 12500 \| 2.6843 \| 7463192 \|
	\| 2.9076 \| 0.4824 \| 15000 \| 2.6587 \| 8950264 \|
	\| 2.8714 \| 0.5628 \| 17500 \| 2.6304 \| 10443344 \|
	\| 2.8716 \| 0.6433 \| 20000 \| 2.6025 \| 11951096 \|
	\| 2.7989 \| 0.7237 \| 22500 \| 2.5822 \| 13432464 \|
	\| 2.7941 \| 0.8041 \| 25000 \| 2.5630 \| 14919424 \|
	\| 2.7692 \| 0.8845 \| 27500 \| 2.5497 \| 16415080 \|
	\| 2.757 \| 0.9649 \| 30000 \| 2.5388 \| 17897832 \|
	\| 2.7024 \| 1.0453 \| 32500 \| 2.6006 \| 19384812 \|
	\| 2.7248 \| 1.1257 \| 35000 \| 2.6042 \| 20876844 \|
	\| 2.6764 \| 1.2061 \| 37500 \| 2.5923 \| 22372340 \|
	\| 2.6854 \| 1.2865 \| 40000 \| 2.5793 \| 23866100 \|
	\| 2.683 \| 1.3669 \| 42500 \| 2.5722 \| 25348084 \|
	\| 2.6871 \| 1.4473 \| 45000 \| 2.5538 \| 26854100 \|
	\| 2.6551 \| 1.5277 \| 47500 \| 2.5443 \| 28332612 \|
	\| 2.661 \| 1.6081 \| 50000 \| 2.5278 \| 29822156 \|
	\| 2.6497 \| 1.6885 \| 52500 \| 2.5266 \| 31319476 \|
	\| 2.6281 \| 1.7689 \| 55000 \| 2.5116 \| 32813220 \|
	\| 2.6067 \| 1.8494 \| 57500 \| 2.5047 \| 34298052 \|
	\| 2.6112 \| 1.9298 \| 60000 \| 2.4935 \| 35783604 \|
	\| 2.5207 \| 2.0102 \| 62500 \| 2.4946 \| 37281092 \|
	\| 2.4799 \| 2.0906 \| 65000 \| 2.4916 \| 38768588 \|
	\| 2.4727 \| 2.1710 \| 67500 \| 2.4866 \| 40252972 \|
	\| 2.4719 \| 2.2514 \| 70000 \| 2.4760 \| 41746300 \|
	\| 2.4738 \| 2.3318 \| 72500 \| 2.4713 \| 43241188 \|
	\| 2.4629 \| 2.4122 \| 75000 \| 2.4630 \| 44730244 \|
	\| 2.4524 \| 2.4926 \| 77500 \| 2.4575 \| 46231060 \|
	\| 2.435 \| 2.5730 \| 80000 \| 2.4553 \| 47718964 \|
	\| 2.4621 \| 2.6534 \| 82500 \| 2.4475 \| 49209724 \|
	\| 2.4492 \| 2.7338 \| 85000 \| 2.4440 \| 50712980 \|
	\| 2.4536 \| 2.8142 \| 87500 \| 2.4394 \| 52204380 \|
	\| 2.4148 \| 2.8946 \| 90000 \| 2.4360 \| 53695620 \|
	\| 2.4243 \| 2.9750 \| 92500 \| 2.4350 \| 55190020 \|

	</details>

	### Framework Versions

	- Transformers 4.48.1
	- Pytorch 2.3.0+cu121
	- Datasets 3.2.0
	- Tokenizers 0.21.0