--- base_model: unsloth/qwen3-1.7b tags: - text-generation-inference - transformers - unsloth - qwen3 - small-language-model - edge-deployment - reasoning - efficient-llm license: apache-2.0 language: - en library_name: transformers model_name: Daemontatox/Droidz --- # 🧠 Model Card: **Daemontatox/Droidz** **Daemontatox/Droidz** is a highly-optimized, compact language model built on top of `unsloth/qwen3-1.7b`, engineered for fast, intelligent inference on **consumer-grade devices**. It's part of an **ongoing research effort** to close the performance gap between small and large language models using architectural efficiency, reflective reasoning techniques, and lightweight distributed training. --- ## 🧬 Objective The goal of Droidz is to: * Achieve **close-to-7B model quality** with <2B parameter models. * Support **edge deployment**: mobile, CPU, small GPU. * Provide **accurate, fast, reflective** generation in constrained environments. * Enable **scalable fine-tuning** through efficient, distributed training pipelines. --- ## 🛠️ Model Overview | Field | Detail | | --------------- | ------------------------------------------------------------ | | Base model | `unsloth/qwen3-1.7b` | | Architecture | Transformer, Qwen3-architecture (2.7x faster rope) | | Finetuned on | Proprietary curated instruction + reasoning dataset | | Training Method | Distributed LoRA + Flash-Attn2 + PEFT + DDP | | Model Size | \~1.7B params | | Precision | bfloat16 (training), supports int4/int8 (inference) | | Language | English only (monolingual) | | License | Apache-2.0 | | Intended Use | Conversational AI, edge agents, assistants, embedded systems | --- ## 🏗️ Training Details ### Training Infrastructure * **Frameworks:** `transformers`, `unsloth`, `accelerate`, `PEFT` * **Backends:** Fully-distributed with `DeepSpeed Zero 2`, `DDP`, `fsdp`, and `Flash Attention v2` * **Devices:** A100 (80GB), RTX 3090 clusters, TPU v5e (mixed) * **Optimizer:** AdamW + Cosine LR schedule + Warmup steps * **Batching:** Dynamic packing enabled, up to 2048 context tokens * **Checkpointing:** Async gradient checkpointing for memory efficiency * **Duration:** \~1.2M steps across multiple domains ### Finetuning Methodology * **Reflection prompting**: Models are trained to self-verify and revise outputs. * **Instruction tuning**: Curated prompt-response pairs across diverse reasoning domains. * **Multi-domain generalization**: Code, logic puzzles, philosophy, and conversational tasks. * **Optimization:** Gradient accumulation + progressive layer freezing. --- ## 🔮 Example Use Cases * **Conversational AI** for mobile and web apps * **Offline reasoning agents** (Raspberry Pi, Jetson Nano, etc.) * **Embedded chatbots** with local-only privacy * **Edge-side logic assistants** for industry-specific workflows * **Autonomous tools** for summarization, code suggestion, self-verification --- ## ⚡ Inference Code ```python from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer model_id = "Daemontatox/Droidz" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", # or {"": "cuda:0"} for manual torch_dtype="auto" # uses bf16/fp16 if available ) streamer = TextStreamer(tokenizer) prompt = "Explain the concept of reinforcement learning simply." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) _ = model.generate(**inputs, max_new_tokens=200, streamer=streamer) ``` --- ## 🧪 Performance Benchmarks | Hardware | Mode | Throughput | VRAM / RAM | Notes | | -------------------------- | ------------ | ------------- | ---------- | -------------------------------- | | RTX 3060 12GB (FP16) | Transformers | \~37 tokens/s | \~5.1 GB | Good for batch inference | | MacBook M2 (Metal backend) | Transformers | \~23 tokens/s | \~3.6 GB | Works well on 8-core M2 | | Intel i7-12700H (CPU-only) | GGUF (Q4) | \~8 tokens/s | \~4.1 GB | Llama.cpp via `llm` or Koboldcpp | | Jetson Orin Nano (8GB) | INT4 GGUF | \~6 tokens/s | \~3.2 GB | Embedded/IoT ready | --- ## 🧠 Prompt Samples ### ❓ Prompt: *"What is backpropagation in neural networks?"* > Backpropagation is a training algorithm that adjusts a neural network’s weights by computing gradients of error from output to input layers using the chain rule. It’s the core of how neural networks learn. ### 🔧 Prompt: *"Fix the bug: \`print('Score:' + 100)"* > You’re trying to concatenate a string with an integer. Use: `print('Score:' + str(100))` ### 🔍 Prompt: *"Summarize the Stoic concept of control."* > Stoics believe in focusing only on what you can control—your actions and thoughts—while accepting what you cannot control with calm detachment. --- ## 🔐 Quantization Support (Deployment-Ready) | Format | Status | Tool | Notes | | -------- | -------- | ------------ | --------------------------- | | GGUF | ✅ Stable | llama.cpp | Works on CPUs, Android, Web | | GPTQ | ✅ Stable | AutoGPTQ | For fast GPU inference | | AWQ | ✅ Tested | AutoAWQ | 4-bit low-latency inference | | FP16 | ✅ Native | Transformers | RTX/Apple Metal ready | | bfloat16 | ✅ | Transformers | For A100/TPU-friendly runs | --- ## 🧱 Architecture Enhancements * **FlashAttention2**: Fused softmax and dropout for 2–3x attention speed boost. * **Unslo†h Patch**: Accelerated training/inference kernel replacements * **Rope Scaling**: Extended context window support for long-input reasoning * **Rotary Embedding Interpolation**: Improves generalization beyond pretraining length * **LayerDrop + Activation Checkpointing**: For ultra-efficient memory training --- ## ✅ Intended Use | Use Case | Suitable | | --------------------------- | -------- | | Local chatbots / assistants | ✅ | | Developer coding copilots | ✅ | | Offline reasoning agents | ✅ | | Educational agents | ✅ | | Legal / financial advisors | ❌ | | Medical diagnosis | ❌ | > Model is not suitable for domains where accuracy or factual correctness is critical without verification. --- ## 🚫 Known Limitations * Context length currently capped at 2048 (can be increased via RoPE interpolation). * Struggles with long-form generation (>1024 tokens). * Not multilingual (yet). * Sensitive to prompt phrasing without CoT or self-correction format. --- ## 📍 Roadmap * [ ] Expand to multilingual support via cross-lingual bootstrapping. * [ ] Integrate Mamba-style recurrence for long-context inference. * [ ] Release optimized GGUF + quantized weights for browser/Android. * [ ] Explore retrieval-augmented reflection (RAR) capabilities. --- ## 👨‍💻 Author * **Name**: Daemontatox * **Affiliation**: Independent Researcher * **Contact**: [HuggingFace Profile](https://huggingface.co/Daemontatox) * **Focus**: LLM compression, theory of mind, agent intelligence on the edge --- ## 📖 Citation ```bibtex @misc{daemontatox2025droidz, title={Droidz: A Fast, Reflective Small Language Model for Reasoning on Edge Devices}, author={Daemontatox}, year={2025}, howpublished={\url{https://huggingface.co/Daemontatox/Droidz}}, note={Ongoing Research} } ```