🧠 LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)

📝 Model Card

LoggenixMoE133M is a small Mixture-of-Experts (MoE) Causal Language Model trained from scratch on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.

Architecture: A lightweight transformer with Mixture-of-Experts routing, inspired by the innovative architectural design of Qwen3 models.
Parameter Count: 133M total, with 2 experts active per token (approx. 80M active per step).
Experts: 8 total, gated per token with router logits.
Activation Strategy: Top-2 routing with auxiliary routing loss.
Tokenizer Features: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: <tool_call> and <think>. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.

📊 Training Details

Attribute	Value
Total Params	133M
MoE Config	8 experts, top-2 gating
Dataset Type	RCA, code, and logic prompts (15+ task splits)
Training Epochs	5
Effective Tokens Seen	1.5 Billion
Train Loss (final)	3.263
Val Loss (final)	3.327
Mean Token Accuracy	~48%
Optimizer	AdamW
Scheduler	Linear Warmup + Cosine Decay
Precision	FP16 with GradScaler
Checkpoint Format	HF-compatible
Training Cost	$94 across Modal (A100 40GB) + Hyperbolic (RTX 4090)
Context Length	1024

🧪 Intended Use

✅ Suitable for:

Instruction-following tasks
Root cause analysis (RCA) and structured summarization
Lightweight code generation (Python)
Chain-of-thought style reasoning prompts
Fine-tuning for specific tasks on edge devices (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
Building specialized AI agents that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)

🚫 Not suitable for:

Long-context tasks (>4K tokens)
High-stakes factual QA
Safety-critical decision-making without oversight

🧨 Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060")

messages = [
   {
       "content": "",
       "role": "system"
   },
   {
       "content": "Write a Python function to compute factorial.",
        "role": "user"
    }
 ]
# Tokenizer

tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))


## Alternatively
with torch.no_grad():
  outputs = model.generate(
                       inputs,
                        max_new_tokens=50,  # Reduced for testing
                        do_sample=True,
                        temperature=0.5,
                        top_p=0.95,
                        return_dict_in_generate=True,
                        use_cache=False  # Disable caching to avoid potential issues
                    )
  generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
  print(generated_text)

🔧 Expert Routing

This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.

During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.

Note: Routing logits are optionally available in the model outputs via output_router_logits=True.

📃 License

This model is released under the Apache 2.0 License.

🙌 Acknowledgements

Trained using:

🧨 Hugging Face Transformers

🧠 Custom training loop with gradient checkpointing

🧮 NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)

📦 Logged and tracked via Weights & Biases

🗣️ Citation

@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060, title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model}, author = {kshitijthakkar}, year = {2025}, url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 }, note = {Trained from scratch on RCA + code + reasoning dataset.} }

kshitijthakkar
/

loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060