Qwen2.5-7B Traditional Chinese ↔ Indonesian Translation Model

This model is a fine-tuned version of Qwen/Qwen2.5-7B-Instruct specifically optimized for Traditional Chinese ↔ Indonesian translation tasks.

Model Description

This model specializes in translating between Traditional Chinese and Indonesian, trained on Taiwan news corpus. It's particularly effective for news, formal documents, and general text translation between these language pairs.

Key Features

  • 🌏 Bidirectional Translation: Traditional Chinese ↔ Indonesian
  • πŸ“° News Domain Optimized: Trained on Taiwan news corpus
  • ⚑ Efficient Fine-tuning: Uses LoRA (Low-Rank Adaptation) for faster training
  • 🎯 Specialized Vocabulary: Enhanced for Taiwan-specific terms and Indonesian equivalents

Training Details

Base Model

  • Base Model: Qwen/Qwen2.5-7B-Instruct
  • Model Type: Causal Language Model with Translation Capabilities

Fine-tuning Configuration

  • Method: LoRA (Low-Rank Adaptation)
  • LoRA Rank: 8
  • LoRA Alpha: 32
  • Learning Rate: 2e-4
  • Training Epochs: 3
  • Max Samples: 1,000 (initial validation)
  • Template: Qwen conversation format

Dataset

  • Source: Taiwan NEWS in Traditional Chinese with Indonesian translations
  • Editor: Chang, Yo Han
  • Domain: News articles and formal text
  • Language Pair: Traditional Chinese (zh-TW) ↔ Indonesian (id)
  • Note: Dataset is proprietary and not publicly available on HuggingFace

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# θΌ‰ε…₯ base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# θΌ‰ε…₯ LoRA adapter
model = PeftModel.from_pretrained(
    base_model, 
    "roylin1003/Royal_ZhTW-ID_finetuned_101"
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Translation function
def translate_text(text, source_lang="zh", target_lang="id"):
    if source_lang == "zh" and target_lang == "id":
        prompt = f"θ«‹ε°‡δ»₯δΈ‹δΈ­ζ–‡ηΏ»θ­―ζˆε°ε°Όζ–‡οΌš{text}"
    elif source_lang == "id" and target_lang == "zh":
        prompt = f"Terjemahkan teks bahasa Indonesia berikut ke bahasa Tionghoa: {text}"
    
    messages = [
        {"role": "user", "content": prompt}
    ]
    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )
    
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

# Example usage
chinese_text = "ε°η£ηš„η§‘ζŠ€η”’ζ₯­η™Όε±•θΏ…ι€ŸοΌŒη‰Ήεˆ₯ζ˜―εœ¨εŠε°Žι«”ι ˜εŸŸγ€‚"
indonesian_translation = translate_text(chinese_text, "zh", "id")
print(f"Chinese: {chinese_text}")
print(f"Indonesian: {indonesian_translation}")

indonesian_text = "Indonesia adalah negara kepulauan terbesar di dunia."
chinese_translation = translate_text(indonesian_text, "id", "zh")
print(f"Indonesian: {indonesian_text}")
print(f"Chinese: {chinese_translation}")

Advanced Usage with Custom Parameters

def translate_with_options(text, source_lang="zh", target_lang="id", temperature=0.7, max_tokens=512):
    # ... (same setup as above)
    
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # ... (same decoding as above)
    return response

Model Performance

Training Metrics

  • Training Loss: Converged after 3 epochs
  • Learning Rate: 2e-4 with linear decay
  • Batch Size: Optimized for available GPU memory

Evaluation

This model has been trained on a curated dataset of Taiwan news articles with Indonesian translations. Performance evaluation is ongoing.

Limitations and Considerations

Known Limitations

  • Domain Specificity: Optimized for news and formal text; may not perform as well on casual conversation
  • Training Data Size: Initial training used 1,000 samples for quick validation
  • Cultural Context: May require additional fine-tuning for region-specific terminology

Recommended Use Cases

  • πŸ“° News article translation
  • πŸ“„ Formal document translation
  • 🏒 Business communication between Taiwan and Indonesia
  • πŸ“š Educational content translation

Not Recommended For

  • Real-time conversation (use specialized conversational models)
  • Medical or legal documents (requires domain-specific models)
  • Creative writing (may lack stylistic nuance)

Training Infrastructure

Hardware Requirements

  • Minimum: GPU with 16GB VRAM
  • Recommended: GPU with 24GB+ VRAM for optimal performance
  • Training Time: Approximately 2-3 hours on modern GPUs

Software Dependencies

transformers>=4.36.0
torch>=2.0.0
peft>=0.7.0
datasets>=2.15.0

Citation

If you use this model in your research or applications, please cite:

@misc{Royal_ZhTW-ID_finetuned_101,
  title={Qwen2.5-7B Traditional Chinese-Indonesian Translation Model},
  author={Roy Lin},
  year={2024},
  howpublished={\url{https://huggingface.co/roylin1003/Royal_ZhTW-ID_finetuned_101}},
  note={Fine-tuned on Taiwan news corpus edited by Chang, Yo Han}
}

Acknowledgments

  • Base Model: Thanks to the Qwen team for the excellent Qwen2.5-7B-Instruct model
  • Dataset: Taiwan news corpus with Indonesian translations edited by Chang, Yo Han
  • Framework: Built using Hugging Face Transformers and PEFT libraries

License

This model is released under the Apache 2.0 License, consistent with the base Qwen2.5-7B-Instruct model.

Contact

For questions, issues, or collaborations, please open an issue in this repository or contact [your contact information].


Model Version: 1.0
Last Updated: [Current Date]
Status: Initial Release - Validation Phase

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for roylin1003/Royal_ZhTW-ID_finetuned_101

Base model

Qwen/Qwen2.5-7B
Adapter
(477)
this model