Qwen2.5-7B Traditional Chinese β Indonesian Translation Model
This model is a fine-tuned version of Qwen/Qwen2.5-7B-Instruct specifically optimized for Traditional Chinese β Indonesian translation tasks.
Model Description
This model specializes in translating between Traditional Chinese and Indonesian, trained on Taiwan news corpus. It's particularly effective for news, formal documents, and general text translation between these language pairs.
Key Features
- π Bidirectional Translation: Traditional Chinese β Indonesian
- π° News Domain Optimized: Trained on Taiwan news corpus
- β‘ Efficient Fine-tuning: Uses LoRA (Low-Rank Adaptation) for faster training
- π― Specialized Vocabulary: Enhanced for Taiwan-specific terms and Indonesian equivalents
Training Details
Base Model
- Base Model: Qwen/Qwen2.5-7B-Instruct
- Model Type: Causal Language Model with Translation Capabilities
Fine-tuning Configuration
- Method: LoRA (Low-Rank Adaptation)
- LoRA Rank: 8
- LoRA Alpha: 32
- Learning Rate: 2e-4
- Training Epochs: 3
- Max Samples: 1,000 (initial validation)
- Template: Qwen conversation format
Dataset
- Source: Taiwan NEWS in Traditional Chinese with Indonesian translations
- Editor: Chang, Yo Han
- Domain: News articles and formal text
- Language Pair: Traditional Chinese (zh-TW) β Indonesian (id)
- Note: Dataset is proprietary and not publicly available on HuggingFace
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# θΌε
₯ base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
# θΌε
₯ LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"roylin1003/Royal_ZhTW-ID_finetuned_101"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Translation function
def translate_text(text, source_lang="zh", target_lang="id"):
if source_lang == "zh" and target_lang == "id":
prompt = f"θ«ε°δ»₯δΈδΈζηΏ»θ―ζε°ε°ΌζοΌ{text}"
elif source_lang == "id" and target_lang == "zh":
prompt = f"Terjemahkan teks bahasa Indonesia berikut ke bahasa Tionghoa: {text}"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
# Example usage
chinese_text = "ε°η£ηη§ζη’ζ₯ηΌε±θΏ
ιοΌηΉε₯ζ―ε¨εε°ι«ι εγ"
indonesian_translation = translate_text(chinese_text, "zh", "id")
print(f"Chinese: {chinese_text}")
print(f"Indonesian: {indonesian_translation}")
indonesian_text = "Indonesia adalah negara kepulauan terbesar di dunia."
chinese_translation = translate_text(indonesian_text, "id", "zh")
print(f"Indonesian: {indonesian_text}")
print(f"Chinese: {chinese_translation}")
Advanced Usage with Custom Parameters
def translate_with_options(text, source_lang="zh", target_lang="id", temperature=0.7, max_tokens=512):
# ... (same setup as above)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=max_tokens,
do_sample=True,
temperature=temperature,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
# ... (same decoding as above)
return response
Model Performance
Training Metrics
- Training Loss: Converged after 3 epochs
- Learning Rate: 2e-4 with linear decay
- Batch Size: Optimized for available GPU memory
Evaluation
This model has been trained on a curated dataset of Taiwan news articles with Indonesian translations. Performance evaluation is ongoing.
Limitations and Considerations
Known Limitations
- Domain Specificity: Optimized for news and formal text; may not perform as well on casual conversation
- Training Data Size: Initial training used 1,000 samples for quick validation
- Cultural Context: May require additional fine-tuning for region-specific terminology
Recommended Use Cases
- π° News article translation
- π Formal document translation
- π’ Business communication between Taiwan and Indonesia
- π Educational content translation
Not Recommended For
- Real-time conversation (use specialized conversational models)
- Medical or legal documents (requires domain-specific models)
- Creative writing (may lack stylistic nuance)
Training Infrastructure
Hardware Requirements
- Minimum: GPU with 16GB VRAM
- Recommended: GPU with 24GB+ VRAM for optimal performance
- Training Time: Approximately 2-3 hours on modern GPUs
Software Dependencies
transformers>=4.36.0
torch>=2.0.0
peft>=0.7.0
datasets>=2.15.0
Citation
If you use this model in your research or applications, please cite:
@misc{Royal_ZhTW-ID_finetuned_101,
title={Qwen2.5-7B Traditional Chinese-Indonesian Translation Model},
author={Roy Lin},
year={2024},
howpublished={\url{https://huggingface.co/roylin1003/Royal_ZhTW-ID_finetuned_101}},
note={Fine-tuned on Taiwan news corpus edited by Chang, Yo Han}
}
Acknowledgments
- Base Model: Thanks to the Qwen team for the excellent Qwen2.5-7B-Instruct model
- Dataset: Taiwan news corpus with Indonesian translations edited by Chang, Yo Han
- Framework: Built using Hugging Face Transformers and PEFT libraries
License
This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-7B-Instruct model.
Contact
For questions, issues, or collaborations, please open an issue in this repository or contact [your contact information].