๐ง LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)
๐ Model Card
LoggenixMoE133M is a small Mixture-of-Experts (MoE) Causal Language Model trained from scratch on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.
- Architecture: A lightweight transformer with Mixture-of-Experts routing, inspired by the innovative architectural design of Qwen3 models.
- Parameter Count: 133M total, with 2 experts active per token (approx. 80M active per step).
- Experts: 8 total, gated per token with router logits.
- Activation Strategy: Top-2 routing with auxiliary routing loss.
- Tokenizer Features: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities:
<tool_call>
and<think>
. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.
๐ Training Details
Attribute | Value |
---|---|
Total Params | 133M |
MoE Config | 8 experts, top-2 gating |
Dataset Type | RCA, code, and logic prompts (15+ task splits) |
Training Epochs | 5 |
Effective Tokens Seen | 1.5 Billion |
Train Loss (final) | 3.263 |
Val Loss (final) | 3.327 |
Mean Token Accuracy | ~48% |
Optimizer | AdamW |
Scheduler | Linear Warmup + Cosine Decay |
Precision | FP16 with GradScaler |
Checkpoint Format | HF-compatible |
Training Cost | $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) |
Context Length | 1024 |
๐งช Intended Use
โ Suitable for:
- Instruction-following tasks
- Root cause analysis (RCA) and structured summarization
- Lightweight code generation (Python)
- Chain-of-thought style reasoning prompts
- Fine-tuning for specific tasks on edge devices (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
- Building specialized AI agents that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)
๐ซ Not suitable for:
- Long-context tasks (>4K tokens)
- High-stakes factual QA
- Safety-critical decision-making without oversight
๐งจ Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060")
messages = [
{
"content": "",
"role": "system"
},
{
"content": "Write a Python function to compute factorial.",
"role": "user"
}
]
# Tokenizer
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
## Alternatively
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=50, # Reduced for testing
do_sample=True,
temperature=0.5,
top_p=0.95,
return_dict_in_generate=True,
use_cache=False # Disable caching to avoid potential issues
)
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(generated_text)
๐ง Expert Routing
This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.
During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.
Note: Routing logits are optionally available in the model outputs via output_router_logits=True.
๐ License
This model is released under the Apache 2.0 License.
๐ Acknowledgements
Trained using:
๐งจ Hugging Face Transformers
๐ง Custom training loop with gradient checkpointing
๐งฎ NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)
๐ฆ Logged and tracked via Weights & Biases
๐ฃ๏ธ Citation
@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060, title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model}, author = {kshitijthakkar}, year = {2025}, url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 }, note = {Trained from scratch on RCA + code + reasoning dataset.} }
- Downloads last month
- 25