๐Ÿง  LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)

Model Size Experts Routing Model Size License: Apache 2.0


๐Ÿ“ Model Card

LoggenixMoE133M is a small Mixture-of-Experts (MoE) Causal Language Model trained from scratch on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.

  • Architecture: A lightweight transformer with Mixture-of-Experts routing, inspired by the innovative architectural design of Qwen3 models.
  • Parameter Count: 133M total, with 2 experts active per token (approx. 80M active per step).
  • Experts: 8 total, gated per token with router logits.
  • Activation Strategy: Top-2 routing with auxiliary routing loss.
  • Tokenizer Features: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: <tool_call> and <think>. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.

๐Ÿ“Š Training Details

Attribute Value
Total Params 133M
MoE Config 8 experts, top-2 gating
Dataset Type RCA, code, and logic prompts (15+ task splits)
Training Epochs 5
Effective Tokens Seen 1.5 Billion
Train Loss (final) 3.263
Val Loss (final) 3.327
Mean Token Accuracy ~48%
Optimizer AdamW
Scheduler Linear Warmup + Cosine Decay
Precision FP16 with GradScaler
Checkpoint Format HF-compatible
Training Cost $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090)
Context Length 1024

๐Ÿงช Intended Use

โœ… Suitable for:

  • Instruction-following tasks
  • Root cause analysis (RCA) and structured summarization
  • Lightweight code generation (Python)
  • Chain-of-thought style reasoning prompts
  • Fine-tuning for specific tasks on edge devices (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
  • Building specialized AI agents that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)

๐Ÿšซ Not suitable for:

  • Long-context tasks (>4K tokens)
  • High-stakes factual QA
  • Safety-critical decision-making without oversight

๐Ÿงจ Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060")

messages = [
   {
       "content": "",
       "role": "system"
   },
   {
       "content": "Write a Python function to compute factorial.",
        "role": "user"
    }
 ]
# Tokenizer

tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))


## Alternatively
with torch.no_grad():
  outputs = model.generate(
                       inputs,
                        max_new_tokens=50,  # Reduced for testing
                        do_sample=True,
                        temperature=0.5,
                        top_p=0.95,
                        return_dict_in_generate=True,
                        use_cache=False  # Disable caching to avoid potential issues
                    )
  generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
  print(generated_text)

๐Ÿ”ง Expert Routing

This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.

During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.

Note: Routing logits are optionally available in the model outputs via output_router_logits=True.


๐Ÿ“ƒ License

This model is released under the Apache 2.0 License.


๐Ÿ™Œ Acknowledgements

Trained using:

๐Ÿงจ Hugging Face Transformers

๐Ÿง  Custom training loop with gradient checkpointing

๐Ÿงฎ NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)

๐Ÿ“ฆ Logged and tracked via Weights & Biases


๐Ÿ—ฃ๏ธ Citation


@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060, title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model}, author = {kshitijthakkar}, year = {2025}, url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 }, note = {Trained from scratch on RCA + code + reasoning dataset.} }

Downloads last month
25
Safetensors
Model size
133M params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support