Droidz / README.md
Daemontatox's picture
Update README.md
e8fe291 verified
---
base_model: unsloth/qwen3-1.7b
tags:
- text-generation-inference
- transformers
- unsloth
- qwen3
- small-language-model
- edge-deployment
- reasoning
- efficient-llm
license: apache-2.0
language:
- en
library_name: transformers
model_name: Daemontatox/Droidz
---
# 🧠 Model Card: **Daemontatox/Droidz**
**Daemontatox/Droidz** is a highly-optimized, compact language model built on top of `unsloth/qwen3-1.7b`, engineered for fast, intelligent inference on **consumer-grade devices**. It's part of an **ongoing research effort** to close the performance gap between small and large language models using architectural efficiency, reflective reasoning techniques, and lightweight distributed training.
---
## 🧬 Objective
The goal of Droidz is to:
* Achieve **close-to-7B model quality** with <2B parameter models.
* Support **edge deployment**: mobile, CPU, small GPU.
* Provide **accurate, fast, reflective** generation in constrained environments.
* Enable **scalable fine-tuning** through efficient, distributed training pipelines.
---
## 🛠️ Model Overview
| Field | Detail |
| --------------- | ------------------------------------------------------------ |
| Base model | `unsloth/qwen3-1.7b` |
| Architecture | Transformer, Qwen3-architecture (2.7x faster rope) |
| Finetuned on | Proprietary curated instruction + reasoning dataset |
| Training Method | Distributed LoRA + Flash-Attn2 + PEFT + DDP |
| Model Size | \~1.7B params |
| Precision | bfloat16 (training), supports int4/int8 (inference) |
| Language | English only (monolingual) |
| License | Apache-2.0 |
| Intended Use | Conversational AI, edge agents, assistants, embedded systems |
---
## 🏗️ Training Details
### Training Infrastructure
* **Frameworks:** `transformers`, `unsloth`, `accelerate`, `PEFT`
* **Backends:** Fully-distributed with `DeepSpeed Zero 2`, `DDP`, `fsdp`, and `Flash Attention v2`
* **Devices:** A100 (80GB), RTX 3090 clusters, TPU v5e (mixed)
* **Optimizer:** AdamW + Cosine LR schedule + Warmup steps
* **Batching:** Dynamic packing enabled, up to 2048 context tokens
* **Checkpointing:** Async gradient checkpointing for memory efficiency
* **Duration:** \~1.2M steps across multiple domains
### Finetuning Methodology
* **Reflection prompting**: Models are trained to self-verify and revise outputs.
* **Instruction tuning**: Curated prompt-response pairs across diverse reasoning domains.
* **Multi-domain generalization**: Code, logic puzzles, philosophy, and conversational tasks.
* **Optimization:** Gradient accumulation + progressive layer freezing.
---
## 🔮 Example Use Cases
* **Conversational AI** for mobile and web apps
* **Offline reasoning agents** (Raspberry Pi, Jetson Nano, etc.)
* **Embedded chatbots** with local-only privacy
* **Edge-side logic assistants** for industry-specific workflows
* **Autonomous tools** for summarization, code suggestion, self-verification
---
## ⚡ Inference Code
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
model_id = "Daemontatox/Droidz"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto", # or {"": "cuda:0"} for manual
torch_dtype="auto" # uses bf16/fp16 if available
)
streamer = TextStreamer(tokenizer)
prompt = "Explain the concept of reinforcement learning simply."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)
```
---
## 🧪 Performance Benchmarks
| Hardware | Mode | Throughput | VRAM / RAM | Notes |
| -------------------------- | ------------ | ------------- | ---------- | -------------------------------- |
| RTX 3060 12GB (FP16) | Transformers | \~37 tokens/s | \~5.1 GB | Good for batch inference |
| MacBook M2 (Metal backend) | Transformers | \~23 tokens/s | \~3.6 GB | Works well on 8-core M2 |
| Intel i7-12700H (CPU-only) | GGUF (Q4) | \~8 tokens/s | \~4.1 GB | Llama.cpp via `llm` or Koboldcpp |
| Jetson Orin Nano (8GB) | INT4 GGUF | \~6 tokens/s | \~3.2 GB | Embedded/IoT ready |
---
## 🧠 Prompt Samples
### ❓ Prompt: *"What is backpropagation in neural networks?"*
> Backpropagation is a training algorithm that adjusts a neural network’s weights by computing gradients of error from output to input layers using the chain rule. It’s the core of how neural networks learn.
### 🔧 Prompt: *"Fix the bug: \`print('Score:' + 100)"*
> You’re trying to concatenate a string with an integer. Use: `print('Score:' + str(100))`
### 🔍 Prompt: *"Summarize the Stoic concept of control."*
> Stoics believe in focusing only on what you can control—your actions and thoughts—while accepting what you cannot control with calm detachment.
---
## 🔐 Quantization Support (Deployment-Ready)
| Format | Status | Tool | Notes |
| -------- | -------- | ------------ | --------------------------- |
| GGUF | ✅ Stable | llama.cpp | Works on CPUs, Android, Web |
| GPTQ | ✅ Stable | AutoGPTQ | For fast GPU inference |
| AWQ | ✅ Tested | AutoAWQ | 4-bit low-latency inference |
| FP16 | ✅ Native | Transformers | RTX/Apple Metal ready |
| bfloat16 | ✅ | Transformers | For A100/TPU-friendly runs |
---
## 🧱 Architecture Enhancements
* **FlashAttention2**: Fused softmax and dropout for 2–3x attention speed boost.
* **Unslo†h Patch**: Accelerated training/inference kernel replacements
* **Rope Scaling**: Extended context window support for long-input reasoning
* **Rotary Embedding Interpolation**: Improves generalization beyond pretraining length
* **LayerDrop + Activation Checkpointing**: For ultra-efficient memory training
---
## ✅ Intended Use
| Use Case | Suitable |
| --------------------------- | -------- |
| Local chatbots / assistants | ✅ |
| Developer coding copilots | ✅ |
| Offline reasoning agents | ✅ |
| Educational agents | ✅ |
| Legal / financial advisors | ❌ |
| Medical diagnosis | ❌ |
> Model is not suitable for domains where accuracy or factual correctness is critical without verification.
---
## 🚫 Known Limitations
* Context length currently capped at 2048 (can be increased via RoPE interpolation).
* Struggles with long-form generation (>1024 tokens).
* Not multilingual (yet).
* Sensitive to prompt phrasing without CoT or self-correction format.
---
## 📍 Roadmap
* [ ] Expand to multilingual support via cross-lingual bootstrapping.
* [ ] Integrate Mamba-style recurrence for long-context inference.
* [ ] Release optimized GGUF + quantized weights for browser/Android.
* [ ] Explore retrieval-augmented reflection (RAR) capabilities.
---
## 👨‍💻 Author
* **Name**: Daemontatox
* **Affiliation**: Independent Researcher
* **Contact**: [HuggingFace Profile](https://huggingface.co/Daemontatox)
* **Focus**: LLM compression, theory of mind, agent intelligence on the edge
---
## 📖 Citation
```bibtex
@misc{daemontatox2025droidz,
title={Droidz: A Fast, Reflective Small Language Model for Reasoning on Edge Devices},
author={Daemontatox},
year={2025},
howpublished={\url{https://huggingface.co/Daemontatox/Droidz}},
note={Ongoing Research}
}
```