|
--- |
|
base_model: unsloth/qwen3-1.7b |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- qwen3 |
|
- small-language-model |
|
- edge-deployment |
|
- reasoning |
|
- efficient-llm |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
model_name: Daemontatox/Droidz |
|
--- |
|
|
|
|
|
|
|
|
|
# 🧠 Model Card: **Daemontatox/Droidz** |
|
|
|
**Daemontatox/Droidz** is a highly-optimized, compact language model built on top of `unsloth/qwen3-1.7b`, engineered for fast, intelligent inference on **consumer-grade devices**. It's part of an **ongoing research effort** to close the performance gap between small and large language models using architectural efficiency, reflective reasoning techniques, and lightweight distributed training. |
|
|
|
--- |
|
|
|
## 🧬 Objective |
|
|
|
The goal of Droidz is to: |
|
|
|
* Achieve **close-to-7B model quality** with <2B parameter models. |
|
* Support **edge deployment**: mobile, CPU, small GPU. |
|
* Provide **accurate, fast, reflective** generation in constrained environments. |
|
* Enable **scalable fine-tuning** through efficient, distributed training pipelines. |
|
|
|
--- |
|
|
|
## 🛠️ Model Overview |
|
|
|
| Field | Detail | |
|
| --------------- | ------------------------------------------------------------ | |
|
| Base model | `unsloth/qwen3-1.7b` | |
|
| Architecture | Transformer, Qwen3-architecture (2.7x faster rope) | |
|
| Finetuned on | Proprietary curated instruction + reasoning dataset | |
|
| Training Method | Distributed LoRA + Flash-Attn2 + PEFT + DDP | |
|
| Model Size | \~1.7B params | |
|
| Precision | bfloat16 (training), supports int4/int8 (inference) | |
|
| Language | English only (monolingual) | |
|
| License | Apache-2.0 | |
|
| Intended Use | Conversational AI, edge agents, assistants, embedded systems | |
|
|
|
--- |
|
|
|
## 🏗️ Training Details |
|
|
|
### Training Infrastructure |
|
|
|
* **Frameworks:** `transformers`, `unsloth`, `accelerate`, `PEFT` |
|
* **Backends:** Fully-distributed with `DeepSpeed Zero 2`, `DDP`, `fsdp`, and `Flash Attention v2` |
|
* **Devices:** A100 (80GB), RTX 3090 clusters, TPU v5e (mixed) |
|
* **Optimizer:** AdamW + Cosine LR schedule + Warmup steps |
|
* **Batching:** Dynamic packing enabled, up to 2048 context tokens |
|
* **Checkpointing:** Async gradient checkpointing for memory efficiency |
|
* **Duration:** \~1.2M steps across multiple domains |
|
|
|
### Finetuning Methodology |
|
|
|
* **Reflection prompting**: Models are trained to self-verify and revise outputs. |
|
* **Instruction tuning**: Curated prompt-response pairs across diverse reasoning domains. |
|
* **Multi-domain generalization**: Code, logic puzzles, philosophy, and conversational tasks. |
|
* **Optimization:** Gradient accumulation + progressive layer freezing. |
|
|
|
--- |
|
|
|
## 🔮 Example Use Cases |
|
|
|
* **Conversational AI** for mobile and web apps |
|
* **Offline reasoning agents** (Raspberry Pi, Jetson Nano, etc.) |
|
* **Embedded chatbots** with local-only privacy |
|
* **Edge-side logic assistants** for industry-specific workflows |
|
* **Autonomous tools** for summarization, code suggestion, self-verification |
|
|
|
--- |
|
|
|
## ⚡ Inference Code |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer |
|
|
|
model_id = "Daemontatox/Droidz" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="auto", # or {"": "cuda:0"} for manual |
|
torch_dtype="auto" # uses bf16/fp16 if available |
|
) |
|
|
|
streamer = TextStreamer(tokenizer) |
|
|
|
prompt = "Explain the concept of reinforcement learning simply." |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer) |
|
``` |
|
|
|
--- |
|
|
|
## 🧪 Performance Benchmarks |
|
|
|
| Hardware | Mode | Throughput | VRAM / RAM | Notes | |
|
| -------------------------- | ------------ | ------------- | ---------- | -------------------------------- | |
|
| RTX 3060 12GB (FP16) | Transformers | \~37 tokens/s | \~5.1 GB | Good for batch inference | |
|
| MacBook M2 (Metal backend) | Transformers | \~23 tokens/s | \~3.6 GB | Works well on 8-core M2 | |
|
| Intel i7-12700H (CPU-only) | GGUF (Q4) | \~8 tokens/s | \~4.1 GB | Llama.cpp via `llm` or Koboldcpp | |
|
| Jetson Orin Nano (8GB) | INT4 GGUF | \~6 tokens/s | \~3.2 GB | Embedded/IoT ready | |
|
|
|
--- |
|
|
|
## 🧠 Prompt Samples |
|
|
|
### ❓ Prompt: *"What is backpropagation in neural networks?"* |
|
|
|
> Backpropagation is a training algorithm that adjusts a neural network’s weights by computing gradients of error from output to input layers using the chain rule. It’s the core of how neural networks learn. |
|
|
|
### 🔧 Prompt: *"Fix the bug: \`print('Score:' + 100)"* |
|
|
|
> You’re trying to concatenate a string with an integer. Use: `print('Score:' + str(100))` |
|
|
|
### 🔍 Prompt: *"Summarize the Stoic concept of control."* |
|
|
|
> Stoics believe in focusing only on what you can control—your actions and thoughts—while accepting what you cannot control with calm detachment. |
|
|
|
--- |
|
|
|
## 🔐 Quantization Support (Deployment-Ready) |
|
|
|
| Format | Status | Tool | Notes | |
|
| -------- | -------- | ------------ | --------------------------- | |
|
| GGUF | ✅ Stable | llama.cpp | Works on CPUs, Android, Web | |
|
| GPTQ | ✅ Stable | AutoGPTQ | For fast GPU inference | |
|
| AWQ | ✅ Tested | AutoAWQ | 4-bit low-latency inference | |
|
| FP16 | ✅ Native | Transformers | RTX/Apple Metal ready | |
|
| bfloat16 | ✅ | Transformers | For A100/TPU-friendly runs | |
|
|
|
--- |
|
|
|
## 🧱 Architecture Enhancements |
|
|
|
* **FlashAttention2**: Fused softmax and dropout for 2–3x attention speed boost. |
|
* **Unslo†h Patch**: Accelerated training/inference kernel replacements |
|
* **Rope Scaling**: Extended context window support for long-input reasoning |
|
* **Rotary Embedding Interpolation**: Improves generalization beyond pretraining length |
|
* **LayerDrop + Activation Checkpointing**: For ultra-efficient memory training |
|
|
|
--- |
|
|
|
## ✅ Intended Use |
|
|
|
| Use Case | Suitable | |
|
| --------------------------- | -------- | |
|
| Local chatbots / assistants | ✅ | |
|
| Developer coding copilots | ✅ | |
|
| Offline reasoning agents | ✅ | |
|
| Educational agents | ✅ | |
|
| Legal / financial advisors | ❌ | |
|
| Medical diagnosis | ❌ | |
|
|
|
> Model is not suitable for domains where accuracy or factual correctness is critical without verification. |
|
|
|
--- |
|
|
|
## 🚫 Known Limitations |
|
|
|
* Context length currently capped at 2048 (can be increased via RoPE interpolation). |
|
* Struggles with long-form generation (>1024 tokens). |
|
* Not multilingual (yet). |
|
* Sensitive to prompt phrasing without CoT or self-correction format. |
|
|
|
--- |
|
|
|
## 📍 Roadmap |
|
|
|
* [ ] Expand to multilingual support via cross-lingual bootstrapping. |
|
* [ ] Integrate Mamba-style recurrence for long-context inference. |
|
* [ ] Release optimized GGUF + quantized weights for browser/Android. |
|
* [ ] Explore retrieval-augmented reflection (RAR) capabilities. |
|
|
|
--- |
|
|
|
## 👨💻 Author |
|
|
|
* **Name**: Daemontatox |
|
* **Affiliation**: Independent Researcher |
|
* **Contact**: [HuggingFace Profile](https://huggingface.co/Daemontatox) |
|
* **Focus**: LLM compression, theory of mind, agent intelligence on the edge |
|
|
|
--- |
|
|
|
## 📖 Citation |
|
|
|
```bibtex |
|
@misc{daemontatox2025droidz, |
|
title={Droidz: A Fast, Reflective Small Language Model for Reasoning on Edge Devices}, |
|
author={Daemontatox}, |
|
year={2025}, |
|
howpublished={\url{https://huggingface.co/Daemontatox/Droidz}}, |
|
note={Ongoing Research} |
|
} |
|
``` |
|
|