mahwizzzz commited on
Commit
3c59d6c
·
verified ·
1 Parent(s): 69541ac

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # SmolUrixtral - Mixtral Inspired Model
6
+
7
+ A PyTorch implementation of a Mixtral inspired transformer model with Mixture of Experts (MoE), designed for text generation and understanding tasks. This model is built on the Mixtral architecture with enhancements like Flash Attention, SWiGLU activation, and Liger kernels for optimized performance.
8
+
9
+ - So, I trained a MoE based a 53M architecture.
10
+ - Trained on Urdu-1M-news-text dataset from HuggingFace consisting of 1M texts for a total of 800 steps
11
+
12
+
13
+ ## 📊 Training Results
14
+ ![Training Loss](./loss.png)
15
+
16
+ ## Features
17
+
18
+ - **Flash Attention**: Efficient attention mechanism with memory optimization
19
+ - **Mixture of Experts (MoE)**: 8 experts with top-2 routing and noisy top-k support
20
+ - **SWiGLU Activation**: Advanced activation function in expert layers
21
+ - **Rotary Positional Embeddings**: Position encoding for sequence understanding
22
+ - **Liger Kernels**: Optimized kernels for faster training (optional)
23
+ - **Distributed Training**: Support for multi-GPU training with DDP
24
+ - **Advanced Optimizer**: AdamW optimizer with custom learning rate scheduling
25
+ - **Gradio Interface**: Interactive web interface for text generation
26
+
27
+ ## Model Architecture
28
+
29
+ ### Default Configuration
30
+ - **Embedding Dimensions**: 384
31
+ - **Decoder Layers**: 4
32
+ - **Attention Heads**: 4
33
+ - **MoE Experts**: 8 (top-2 routing)
34
+ - **Block Size**: 512 tokens
35
+ - **Vocabulary Size**: Based on Llama-2-7b tokenizer (~32,000 tokens)
36
+ - **Batch Size (micro)**: 2
37
+ - **Gradient Accumulation Steps**: 4
38
+
39
+ ### Full Parameter List
40
+
41
+ #### Model Architecture Parameters
42
+ - `epochs`: Number of training epochs (default: 4)
43
+ - `block_size`: Maximum sequence length (default: 1024)
44
+ - `batch_size`: Training batch size (default: 16)
45
+ - `embeddings_dims`: Model embedding dimensions (default: 512)
46
+ - `no_of_heads`: Number of attention heads (default: 8)
47
+ - `no_of_decoder_layers`: Number of decoder layers (default: 8)
48
+ - `attn_dropout`: Attention dropout rate (default: 0.1)
49
+ - `dropout`: General dropout rate (default: 0.1)
50
+
51
+ #### Mixture of Experts (MoE) Parameters
52
+ - `experts`: Number of MoE experts (default: 8)
53
+ - `top_experts`: Number of experts to route to (default: 2)
54
+ - `noisy_topk`: Use noisy top-k routing (default: False)
55
+
56
+ #### Training Hyperparameters
57
+ - `max_lr`: Maximum learning rate (default: 6e-4)
58
+ - `weight_decay_optim`: Weight decay for optimizer (default: 0.01)
59
+ - `beta_1`: Beta1 for optimizer (default: 0.9)
60
+ - `beta_2`: Beta2 for optimizer (default: 0.95)
61
+ - `eps`: Epsilon for optimizer (default: 1e-8)
62
+ - `clip`: Gradient clipping value (default: 1.0)
63
+
64
+ #### System Configuration
65
+ - `device`: Device to use (default: 'cuda:0')
66
+ - `use_checkpointing`: Use gradient checkpointing (default: False)
67
+ - `use_liger`: Use Liger kernels for optimization (default: True)
68
+ - `use_flash_attention`: Use Flash Attention (default: True)
69
+ - `use_compile`: Use torch.compile (default: True)
70
+
71
+ #### Data Configuration
72
+ - `vocab_size`: Vocabulary size (default: based on tokenizer + 768)
73
+ - `val_epochs`: Validation frequency (default: 2)
74
+
75
+ ## License
76
+
77
+ MIT License