File size: 6,418 Bytes
3663285
 
07c4a5c
 
 
3663285
 
0ec3aec
 
 
3663285
4d37c0b
 
 
 
 
 
6309b84
 
 
4d37c0b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
---

> [!NOTE]  
> Hii!!! This is a side project, so is not the best.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/dGyEQuQNl80XhlXvGrGGF.png)

# AURORA-Tiny 🌅✨
*Adaptive Unified Reasoning and Organized Reasoning Architecture - Tiny*

A ultra-lightweight text diffusion model that generates coherent text through iterative denoising. AURORA-Tiny combines the power of transformer architectures with diffusion processes in a compact, efficient design perfect for local training and experimentation.

>[!NOTE]
> The model is 6M parameters.

## ✨ Features

- **Ultra-Compact Design**: Optimized for local training with minimal hardware requirements
- **Transformer-based Architecture**: Multi-head attention with time conditioning in a tiny footprint
- **Diffusion Process**: Iterative denoising for high-quality text generation  
- **Flexible Training**: Works with any plain text dataset from Hugging Face
- **Efficient Training**: Train on CPU or modest GPUs in minutes, not hours
- **Prompt-based Generation**: Support for both conditional and unconditional generation

## 🚀 Quick Start

### Installation

```bash
pip install torch torchvision torchaudio
pip install datasets matplotlib tqdm numpy
```

### Basic Usage

```python
from aurora import DiffusionTrainer, TextTokenizer, DiffusionTransformer, DiffusionSchedule

# Load your dataset (or use built-in loader)
texts = load_hf_dataset("rotten_tomatoes", max_samples=3000)

# Build tokenizer
tokenizer = TextTokenizer(vocab_size=2000)
tokenizer.fit(texts)

# Initialize model
model = DiffusionTransformer(
    vocab_size=len(tokenizer.word_to_id),
    d_model=256,
    n_heads=8,
    n_layers=6
)

# Train
trainer = DiffusionTrainer(model, tokenizer, schedule, device='cuda')
trainer.train(train_loader, val_loader, epochs=15)

# Generate text
generated_text = trainer.generate("This movie is", max_length=30)
print(generated_text)
```

## 🏗️ Architecture

AURORA-Tiny uses a novel combination of:

1. **Time-Conditioned Transformers**: Each transformer block receives timestep embeddings
2. **Sinusoidal Time Embeddings**: Continuous time representation for the diffusion process  
3. **Linear Noise Schedule**: Gradual noise addition during forward diffusion
4. **DDIM-style Sampling**: Deterministic sampling for consistent generation

### Model Components

- **Token Embedding**: Maps discrete tokens to continuous space
- **Position Encoding**: Learnable positional embeddings
- **Time Conditioning**: Sinusoidal embeddings injected into each layer
- **Multi-Head Attention**: Standard transformer attention with time modulation
- **Output Projection**: Maps back to vocabulary space

*Tested on RTX 3060, batch_size=16, 15 epochs. Model size: ~2.4M parameters*

## 🎛️ Configuration

### Model Hyperparameters

```python
model_config = {
    'vocab_size': 2000,      # Vocabulary size
    'd_model': 256,          # Hidden dimension
    'n_heads': 8,            # Attention heads
    'n_layers': 6,           # Transformer layers
    'max_seq_len': 64,       # Maximum sequence length
    'timesteps': 100         # Diffusion timesteps
}
```

### Training Parameters

```python
training_config = {
    'batch_size': 16,        # Batch size
    'learning_rate': 1e-4,   # Learning rate
    'weight_decay': 0.01,    # L2 regularization
    'epochs': 15,            # Training epochs
    'grad_clip': 1.0         # Gradient clipping
}
```

## 📚 Supported Datasets

AURORA-Tiny works with any text dataset from Hugging Face. Pre-configured datasets include:

- **rotten_tomatoes** - Movie reviews (8.5k samples)
- **imdb** - Movie reviews (50k samples) 
- **ag_news** - News articles (120k samples)
- **poem_sentiment** - Poetry (890 samples)
- **yelp_review_full** - Restaurant reviews (650k samples)

## 🎯 Generation Strategies

### Conditional Generation
```python
# Generate from a prompt
text = trainer.generate("The movie was", max_length=50, num_steps=20)
```

### Unconditional Generation
```python
# Generate from scratch
text = trainer.generate("", max_length=50, num_steps=20)
```

### Fine-tuned Sampling
```python
# Control generation quality vs speed
text = trainer.generate(
    prompt="Breaking news",
    max_length=100,
    num_steps=50,  # More steps = higher quality
)
```

## 🔬 Technical Details

### Diffusion Process

AURORA-Tiny uses a forward diffusion process that gradually adds Gaussian noise to text embeddings:

```
q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I)
```

The reverse process is learned by the neural network:

```
p_θ(x_{t-1} | x_t, t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
```

### Training Objective

The model is trained to minimize the variational lower bound:

```
L = E_t,x_0,ε [||ε - ε_θ(√(ᾱ_t)x_0 + √(1-ᾱ_t)ε, t)||²]
```

## 📈 Monitoring

Training progress is automatically tracked and visualized:

- **Loss Curves**: Training and validation loss over epochs
- **Vocabulary Stats**: Word frequency distributions  
- **Generation Samples**: Example outputs during training

## 🛠️ Customization

### Custom Tokenizer
```python
class CustomTokenizer(TextTokenizer):
    def __init__(self, vocab_size=5000):
        super().__init__(vocab_size)
        # Add custom preprocessing
        
    def preprocess(self, text):
        # Custom text preprocessing
        return text.lower().strip()
```

### Custom Architecture
```python
model = DiffusionTransformer(
    vocab_size=vocab_size,
    d_model=512,       # Larger model
    n_heads=16,        # More attention heads  
    n_layers=12,       # Deeper network
    timesteps=1000     # More diffusion steps
)
```

## 🎨 Creative Applications

AURORA-Tiny excels at:

- **Story Continuation**: Complete narrative fragments
- **Style Transfer**: Generate text in specific styles  
- **Creative Writing**: Poetry, fiction, and experimental text
- **Data Augmentation**: Generate synthetic training data
- **Content Variation**: Create multiple versions of text

## 🤝 Contributing

Contributions welcome! Areas for improvement:

- Better noise schedules (cosine, learned schedules)
- Advanced sampling methods (DPM-Solver, PLMS)
- Larger model architectures
- Multi-modal extensions
- Evaluation benchmarks

---

*AURORA - Where text generation meets the dawn of diffusion* 🌅