Hii!!! This is a side project, so is not the best.

AURORA-Tiny 🌅✨

Adaptive Unified Reasoning and Organized Reasoning Architecture - Tiny

A ultra-lightweight text diffusion model that generates coherent text through iterative denoising. AURORA-Tiny combines the power of transformer architectures with diffusion processes in a compact, efficient design perfect for local training and experimentation.

The model is 6M parameters.

✨ Features

Ultra-Compact Design: Optimized for local training with minimal hardware requirements
Transformer-based Architecture: Multi-head attention with time conditioning in a tiny footprint
Diffusion Process: Iterative denoising for high-quality text generation
Flexible Training: Works with any plain text dataset from Hugging Face
Efficient Training: Train on CPU or modest GPUs in minutes, not hours
Prompt-based Generation: Support for both conditional and unconditional generation

🚀 Quick Start

Installation

pip install torch torchvision torchaudio
pip install datasets matplotlib tqdm numpy

Basic Usage

from aurora import DiffusionTrainer, TextTokenizer, DiffusionTransformer, DiffusionSchedule

# Load your dataset (or use built-in loader)
texts = load_hf_dataset("rotten_tomatoes", max_samples=3000)

# Build tokenizer
tokenizer = TextTokenizer(vocab_size=2000)
tokenizer.fit(texts)

# Initialize model
model = DiffusionTransformer(
    vocab_size=len(tokenizer.word_to_id),
    d_model=256,
    n_heads=8,
    n_layers=6
)

# Train
trainer = DiffusionTrainer(model, tokenizer, schedule, device='cuda')
trainer.train(train_loader, val_loader, epochs=15)

# Generate text
generated_text = trainer.generate("This movie is", max_length=30)
print(generated_text)

🏗️ Architecture

AURORA-Tiny uses a novel combination of:

Time-Conditioned Transformers: Each transformer block receives timestep embeddings
Sinusoidal Time Embeddings: Continuous time representation for the diffusion process
Linear Noise Schedule: Gradual noise addition during forward diffusion
DDIM-style Sampling: Deterministic sampling for consistent generation

Model Components

Token Embedding: Maps discrete tokens to continuous space
Position Encoding: Learnable positional embeddings
Time Conditioning: Sinusoidal embeddings injected into each layer
Multi-Head Attention: Standard transformer attention with time modulation
Output Projection: Maps back to vocabulary space

Tested on RTX 3060, batch_size=16, 15 epochs. Model size: ~2.4M parameters

🎛️ Configuration

Model Hyperparameters

model_config = {
    'vocab_size': 2000,      # Vocabulary size
    'd_model': 256,          # Hidden dimension
    'n_heads': 8,            # Attention heads
    'n_layers': 6,           # Transformer layers
    'max_seq_len': 64,       # Maximum sequence length
    'timesteps': 100         # Diffusion timesteps
}

Training Parameters

training_config = {
    'batch_size': 16,        # Batch size
    'learning_rate': 1e-4,   # Learning rate
    'weight_decay': 0.01,    # L2 regularization
    'epochs': 15,            # Training epochs
    'grad_clip': 1.0         # Gradient clipping
}

📚 Supported Datasets

AURORA-Tiny works with any text dataset from Hugging Face. Pre-configured datasets include:

rotten_tomatoes - Movie reviews (8.5k samples)
imdb - Movie reviews (50k samples)
ag_news - News articles (120k samples)
poem_sentiment - Poetry (890 samples)
yelp_review_full - Restaurant reviews (650k samples)

🎯 Generation Strategies

Conditional Generation

# Generate from a prompt
text = trainer.generate("The movie was", max_length=50, num_steps=20)

Unconditional Generation

# Generate from scratch
text = trainer.generate("", max_length=50, num_steps=20)

Fine-tuned Sampling

# Control generation quality vs speed
text = trainer.generate(
    prompt="Breaking news",
    max_length=100,
    num_steps=50,  # More steps = higher quality
)

🔬 Technical Details

Diffusion Process

AURORA-Tiny uses a forward diffusion process that gradually adds Gaussian noise to text embeddings:

q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I)

The reverse process is learned by the neural network:

p_θ(x_{t-1} | x_t, t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

Training Objective

The model is trained to minimize the variational lower bound:

L = E_t,x_0,ε [||ε - ε_θ(√(ᾱ_t)x_0 + √(1-ᾱ_t)ε, t)||²]

📈 Monitoring

Training progress is automatically tracked and visualized:

Loss Curves: Training and validation loss over epochs
Vocabulary Stats: Word frequency distributions
Generation Samples: Example outputs during training

🛠️ Customization

Custom Tokenizer

class CustomTokenizer(TextTokenizer):
    def __init__(self, vocab_size=5000):
        super().__init__(vocab_size)
        # Add custom preprocessing
        
    def preprocess(self, text):
        # Custom text preprocessing
        return text.lower().strip()

Custom Architecture

model = DiffusionTransformer(
    vocab_size=vocab_size,
    d_model=512,       # Larger model
    n_heads=16,        # More attention heads  
    n_layers=12,       # Deeper network
    timesteps=1000     # More diffusion steps
)

🎨 Creative Applications

AURORA-Tiny excels at:

Story Continuation: Complete narrative fragments
Style Transfer: Generate text in specific styles
Creative Writing: Poetry, fiction, and experimental text
Data Augmentation: Generate synthetic training data
Content Variation: Create multiple versions of text

🤝 Contributing

Contributions welcome! Areas for improvement:

Better noise schedules (cosine, learned schedules)
Advanced sampling methods (DPM-Solver, PLMS)
Larger model architectures
Multi-modal extensions
Evaluation benchmarks

AURORA - Where text generation meets the dawn of diffusion 🌅