--- license: apache-2.0 language: - en pipeline_tag: text-generation --- > [!NOTE] > Hii!!! This is a side project, so is not the best. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/dGyEQuQNl80XhlXvGrGGF.png) # AURORA-Tiny 🌅✨ *Adaptive Unified Reasoning and Organized Reasoning Architecture - Tiny* A ultra-lightweight text diffusion model that generates coherent text through iterative denoising. AURORA-Tiny combines the power of transformer architectures with diffusion processes in a compact, efficient design perfect for local training and experimentation. >[!NOTE] > The model is 6M parameters. ## ✨ Features - **Ultra-Compact Design**: Optimized for local training with minimal hardware requirements - **Transformer-based Architecture**: Multi-head attention with time conditioning in a tiny footprint - **Diffusion Process**: Iterative denoising for high-quality text generation - **Flexible Training**: Works with any plain text dataset from Hugging Face - **Efficient Training**: Train on CPU or modest GPUs in minutes, not hours - **Prompt-based Generation**: Support for both conditional and unconditional generation ## 🚀 Quick Start ### Installation ```bash pip install torch torchvision torchaudio pip install datasets matplotlib tqdm numpy ``` ### Basic Usage ```python from aurora import DiffusionTrainer, TextTokenizer, DiffusionTransformer, DiffusionSchedule # Load your dataset (or use built-in loader) texts = load_hf_dataset("rotten_tomatoes", max_samples=3000) # Build tokenizer tokenizer = TextTokenizer(vocab_size=2000) tokenizer.fit(texts) # Initialize model model = DiffusionTransformer( vocab_size=len(tokenizer.word_to_id), d_model=256, n_heads=8, n_layers=6 ) # Train trainer = DiffusionTrainer(model, tokenizer, schedule, device='cuda') trainer.train(train_loader, val_loader, epochs=15) # Generate text generated_text = trainer.generate("This movie is", max_length=30) print(generated_text) ``` ## 🏗️ Architecture AURORA-Tiny uses a novel combination of: 1. **Time-Conditioned Transformers**: Each transformer block receives timestep embeddings 2. **Sinusoidal Time Embeddings**: Continuous time representation for the diffusion process 3. **Linear Noise Schedule**: Gradual noise addition during forward diffusion 4. **DDIM-style Sampling**: Deterministic sampling for consistent generation ### Model Components - **Token Embedding**: Maps discrete tokens to continuous space - **Position Encoding**: Learnable positional embeddings - **Time Conditioning**: Sinusoidal embeddings injected into each layer - **Multi-Head Attention**: Standard transformer attention with time modulation - **Output Projection**: Maps back to vocabulary space *Tested on RTX 3060, batch_size=16, 15 epochs. Model size: ~2.4M parameters* ## 🎛️ Configuration ### Model Hyperparameters ```python model_config = { 'vocab_size': 2000, # Vocabulary size 'd_model': 256, # Hidden dimension 'n_heads': 8, # Attention heads 'n_layers': 6, # Transformer layers 'max_seq_len': 64, # Maximum sequence length 'timesteps': 100 # Diffusion timesteps } ``` ### Training Parameters ```python training_config = { 'batch_size': 16, # Batch size 'learning_rate': 1e-4, # Learning rate 'weight_decay': 0.01, # L2 regularization 'epochs': 15, # Training epochs 'grad_clip': 1.0 # Gradient clipping } ``` ## 📚 Supported Datasets AURORA-Tiny works with any text dataset from Hugging Face. Pre-configured datasets include: - **rotten_tomatoes** - Movie reviews (8.5k samples) - **imdb** - Movie reviews (50k samples) - **ag_news** - News articles (120k samples) - **poem_sentiment** - Poetry (890 samples) - **yelp_review_full** - Restaurant reviews (650k samples) ## 🎯 Generation Strategies ### Conditional Generation ```python # Generate from a prompt text = trainer.generate("The movie was", max_length=50, num_steps=20) ``` ### Unconditional Generation ```python # Generate from scratch text = trainer.generate("", max_length=50, num_steps=20) ``` ### Fine-tuned Sampling ```python # Control generation quality vs speed text = trainer.generate( prompt="Breaking news", max_length=100, num_steps=50, # More steps = higher quality ) ``` ## 🔬 Technical Details ### Diffusion Process AURORA-Tiny uses a forward diffusion process that gradually adds Gaussian noise to text embeddings: ``` q(x_t | x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I) ``` The reverse process is learned by the neural network: ``` p_θ(x_{t-1} | x_t, t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t)) ``` ### Training Objective The model is trained to minimize the variational lower bound: ``` L = E_t,x_0,ε [||ε - ε_θ(√(ᾱ_t)x_0 + √(1-ᾱ_t)ε, t)||²] ``` ## 📈 Monitoring Training progress is automatically tracked and visualized: - **Loss Curves**: Training and validation loss over epochs - **Vocabulary Stats**: Word frequency distributions - **Generation Samples**: Example outputs during training ## 🛠️ Customization ### Custom Tokenizer ```python class CustomTokenizer(TextTokenizer): def __init__(self, vocab_size=5000): super().__init__(vocab_size) # Add custom preprocessing def preprocess(self, text): # Custom text preprocessing return text.lower().strip() ``` ### Custom Architecture ```python model = DiffusionTransformer( vocab_size=vocab_size, d_model=512, # Larger model n_heads=16, # More attention heads n_layers=12, # Deeper network timesteps=1000 # More diffusion steps ) ``` ## 🎨 Creative Applications AURORA-Tiny excels at: - **Story Continuation**: Complete narrative fragments - **Style Transfer**: Generate text in specific styles - **Creative Writing**: Poetry, fiction, and experimental text - **Data Augmentation**: Generate synthetic training data - **Content Variation**: Create multiple versions of text ## 🤝 Contributing Contributions welcome! Areas for improvement: - Better noise schedules (cosine, learned schedules) - Advanced sampling methods (DPM-Solver, PLMS) - Larger model architectures - Multi-modal extensions - Evaluation benchmarks --- *AURORA - Where text generation meets the dawn of diffusion* 🌅