File size: 6,418 Bytes
3663285 07c4a5c 3663285 0ec3aec 3663285 4d37c0b 6309b84 4d37c0b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
---
> [!NOTE]
> Hii!!! This is a side project, so is not the best.

# AURORA-Tiny 🌅✨
*Adaptive Unified Reasoning and Organized Reasoning Architecture - Tiny*
A ultra-lightweight text diffusion model that generates coherent text through iterative denoising. AURORA-Tiny combines the power of transformer architectures with diffusion processes in a compact, efficient design perfect for local training and experimentation.
>[!NOTE]
> The model is 6M parameters.
## ✨ Features
- **Ultra-Compact Design**: Optimized for local training with minimal hardware requirements
- **Transformer-based Architecture**: Multi-head attention with time conditioning in a tiny footprint
- **Diffusion Process**: Iterative denoising for high-quality text generation
- **Flexible Training**: Works with any plain text dataset from Hugging Face
- **Efficient Training**: Train on CPU or modest GPUs in minutes, not hours
- **Prompt-based Generation**: Support for both conditional and unconditional generation
## 🚀 Quick Start
### Installation
```bash
pip install torch torchvision torchaudio
pip install datasets matplotlib tqdm numpy
```
### Basic Usage
```python
from aurora import DiffusionTrainer, TextTokenizer, DiffusionTransformer, DiffusionSchedule
# Load your dataset (or use built-in loader)
texts = load_hf_dataset("rotten_tomatoes", max_samples=3000)
# Build tokenizer
tokenizer = TextTokenizer(vocab_size=2000)
tokenizer.fit(texts)
# Initialize model
model = DiffusionTransformer(
vocab_size=len(tokenizer.word_to_id),
d_model=256,
n_heads=8,
n_layers=6
)
# Train
trainer = DiffusionTrainer(model, tokenizer, schedule, device='cuda')
trainer.train(train_loader, val_loader, epochs=15)
# Generate text
generated_text = trainer.generate("This movie is", max_length=30)
print(generated_text)
```
## 🏗️ Architecture
AURORA-Tiny uses a novel combination of:
1. **Time-Conditioned Transformers**: Each transformer block receives timestep embeddings
2. **Sinusoidal Time Embeddings**: Continuous time representation for the diffusion process
3. **Linear Noise Schedule**: Gradual noise addition during forward diffusion
4. **DDIM-style Sampling**: Deterministic sampling for consistent generation
### Model Components
- **Token Embedding**: Maps discrete tokens to continuous space
- **Position Encoding**: Learnable positional embeddings
- **Time Conditioning**: Sinusoidal embeddings injected into each layer
- **Multi-Head Attention**: Standard transformer attention with time modulation
- **Output Projection**: Maps back to vocabulary space
*Tested on RTX 3060, batch_size=16, 15 epochs. Model size: ~2.4M parameters*
## 🎛️ Configuration
### Model Hyperparameters
```python
model_config = {
'vocab_size': 2000, # Vocabulary size
'd_model': 256, # Hidden dimension
'n_heads': 8, # Attention heads
'n_layers': 6, # Transformer layers
'max_seq_len': 64, # Maximum sequence length
'timesteps': 100 # Diffusion timesteps
}
```
### Training Parameters
```python
training_config = {
'batch_size': 16, # Batch size
'learning_rate': 1e-4, # Learning rate
'weight_decay': 0.01, # L2 regularization
'epochs': 15, # Training epochs
'grad_clip': 1.0 # Gradient clipping
}
```
## 📚 Supported Datasets
AURORA-Tiny works with any text dataset from Hugging Face. Pre-configured datasets include:
- **rotten_tomatoes** - Movie reviews (8.5k samples)
- **imdb** - Movie reviews (50k samples)
- **ag_news** - News articles (120k samples)
- **poem_sentiment** - Poetry (890 samples)
- **yelp_review_full** - Restaurant reviews (650k samples)
## 🎯 Generation Strategies
### Conditional Generation
```python
# Generate from a prompt
text = trainer.generate("The movie was", max_length=50, num_steps=20)
```
### Unconditional Generation
```python
# Generate from scratch
text = trainer.generate("", max_length=50, num_steps=20)
```
### Fine-tuned Sampling
```python
# Control generation quality vs speed
text = trainer.generate(
prompt="Breaking news",
max_length=100,
num_steps=50, # More steps = higher quality
)
```
## 🔬 Technical Details
### Diffusion Process
AURORA-Tiny uses a forward diffusion process that gradually adds Gaussian noise to text embeddings:
```
q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I)
```
The reverse process is learned by the neural network:
```
p_θ(x_{t-1} | x_t, t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
```
### Training Objective
The model is trained to minimize the variational lower bound:
```
L = E_t,x_0,ε [||ε - ε_θ(√(ᾱ_t)x_0 + √(1-ᾱ_t)ε, t)||²]
```
## 📈 Monitoring
Training progress is automatically tracked and visualized:
- **Loss Curves**: Training and validation loss over epochs
- **Vocabulary Stats**: Word frequency distributions
- **Generation Samples**: Example outputs during training
## 🛠️ Customization
### Custom Tokenizer
```python
class CustomTokenizer(TextTokenizer):
def __init__(self, vocab_size=5000):
super().__init__(vocab_size)
# Add custom preprocessing
def preprocess(self, text):
# Custom text preprocessing
return text.lower().strip()
```
### Custom Architecture
```python
model = DiffusionTransformer(
vocab_size=vocab_size,
d_model=512, # Larger model
n_heads=16, # More attention heads
n_layers=12, # Deeper network
timesteps=1000 # More diffusion steps
)
```
## 🎨 Creative Applications
AURORA-Tiny excels at:
- **Story Continuation**: Complete narrative fragments
- **Style Transfer**: Generate text in specific styles
- **Creative Writing**: Poetry, fiction, and experimental text
- **Data Augmentation**: Generate synthetic training data
- **Content Variation**: Create multiple versions of text
## 🤝 Contributing
Contributions welcome! Areas for improvement:
- Better noise schedules (cosine, learned schedules)
- Advanced sampling methods (DPM-Solver, PLMS)
- Larger model architectures
- Multi-modal extensions
- Evaluation benchmarks
---
*AURORA - Where text generation meets the dawn of diffusion* 🌅 |