TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice
Model Description
TC-MoE is a novel Mixture-of-Experts (MoE) architecture that enhances traditional MoE models through expert space expansion. By applying the ternary set {-1, 0, 1} to each original expert, TC-MoE achieves:
- โ9% reduction in activated experts compared to Top-K routing
- โ1.1% average performance gain on language understanding benchmarks
- Flexible efficiency-effectiveness trade-off via reward mechanism
Key innovations:
- ๐ฏ โTernary Expert Expansion: Creates parameter-sharing expert variants (-1, 0, +1) without significant computational overhead
- โ๏ธ โAdaptive Load Balancing: Novel load balance loss for expert workload distribution
- ๐ฎ โReward-Driven Routing: Dynamic control of expert activation ratios
Model Overview
- โArchitecture: Decoder-only transformer based on LLaMA
- โPretraining Data:
- RedPajama (100B tokens)
- โModel Size:
- Base (681M/2.3B params)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("stiger1000/TC-MoE")
tokenizer = AutoTokenizer.from_pretrained("stiger1000/TC-MoE")
inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Training Details
- Optimizer: AdamW (ฮฒโ=0.9, ฮฒโ=0.95)
- Learning Rate: 1e-4 with cosine decay
- Batch Size: 4M tokens
- Loss Components:
- Language Modeling Loss
- Load Balance Loss (ฮฑโ=0.01)
- Reward Loss (ฮฑโ=0.0)
Citation
@inproceedings{yan2025tcmoe,
title={TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice},
author={Yan, Shen and Bin, Xingyan and Zhang, Sijun and Wang, Yisen and Lin, Zhouchen},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}
๐ Repository: GitHub
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.