Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /H100_LIGHTWEIGHT_GUIDE.md

Tonic

merge tonic into main for major refactor

42f4411 verified about 2 months ago

preview code

raw

history blame

7.43 kB

H100 Lightweight Training Configuration Guide

This guide explains the new H100 Lightweight (Rapid) training configuration, optimized for rapid fine-tuning on H100 GPUs with a small, carefully selected dataset.

🎯 Overview

The H100 Lightweight configuration is designed for:

Rapid experimentation on H100 GPUs
Efficient training with 80K carefully selected samples
Quick iteration for research and development
Cost-effective training sessions

🚀 Key Features

Optimized for H100

Batch Size: 16 (larger than A100 configs)
Gradient Accumulation: 4 (reduced for faster updates)
Learning Rate: 8e-6 (slightly higher for rapid convergence)
Sequence Length: 8192 (full context window)

Dataset Sampling

Source: OpenHermes-FR dataset
Sample Size: 80,000 random samples
Validation: 1,000 samples (if available)
Reproducibility: Fixed random seed (42)

Training Optimizations

Warmup Steps: 50 (reduced for rapid training)
Evaluation: Every 50 steps
Logging: Every 5 steps
Saving: Every 200 steps
Checkpoints: Keep only 2 (save storage)

📊 Configuration Details

Model Configuration

model_name="HuggingFaceTB/SmolLM3-3B"
max_seq_length=8192
use_flash_attention=True
use_gradient_checkpointing=True

Training Parameters

batch_size=16
gradient_accumulation_steps=4
learning_rate=8e-6
warmup_steps=50
max_epochs=1

H100-Specific Optimizations

dataloader_num_workers=4
dataloader_pin_memory=True
gradient_clipping=1.0
group_by_length=True
pad_to_multiple_of=8

Memory Optimizations

save_total_limit=2
early_stopping_patience=3
max_grad_norm=1.0
warmup_ratio=0.1

🔧 Usage

Interactive Selection

./launch.sh
# Select "H100 Lightweight (Rapid)" when prompted

Expected Training Time

H100: ~2-4 hours (depending on hardware)
A100: ~4-6 hours
V100: ~6-8 hours

Memory Requirements

GPU Memory: 40GB+ (H100 recommended)
System RAM: 32GB+
Storage: 50GB+ for dataset and checkpoints

📈 Performance Characteristics

Training Speed

Steps per Second: ~2-3 (on H100)
Samples per Second: ~32-48
Effective Batch Size: 64 (16 × 4)

Convergence

Expected Loss: 1.2-1.8 (after 1 epoch)
Evaluation Frequency: Every 50 steps
Early Stopping: After 3 evaluations without improvement

Dataset Efficiency

80K samples: ~1.3% of full OpenHermes-FR
Random sampling: Ensures diversity
Fixed seed: Reproducible results

🎯 Use Cases

Perfect For

Rapid prototyping of new ideas
Hyperparameter tuning experiments
Model comparison studies
Research validation before full training
Educational purposes and learning

Not Recommended For

Production models (use Multiple Passes instead)
Competition submissions (use full dataset)
Research papers (use complete training)

🔄 Comparison with Other Configurations

Configuration	Dataset Size	Batch Size	Epochs	Training Time	Use Case
Basic Training	Full SmolTalk	2	3	6-8 hours	Learning
H100 Lightweight	80K Hermes-FR	16	1	2-4 hours	Rapid experiments
A100 Large Scale	Full Hermes-FR	8	1.3	8-12 hours	Serious research
Multiple Passes	Full Hermes-FR	6	4	24-36 hours	Production

🛠️ Customization

Modifying Sample Size

# In the launch script, you can modify:
DATASET_SAMPLE_SIZE=50000  # For 50K samples
DATASET_SAMPLE_SIZE=100000 # For 100K samples

Adjusting Training Parameters

# Modify in config/train_smollm3_h100_lightweight.py:
batch_size=12              # Smaller batch size
learning_rate=6e-6         # Lower learning rate
warmup_steps=100          # More warmup steps

Changing Dataset

# Modify the dataset name in the configuration:
dataset_name="your-custom-dataset"

📊 Monitoring and Results

Trackio Integration

Real-time metrics: Loss, learning rate, gradient norm
Training curves: Visual progress tracking
Resource usage: GPU utilization, memory consumption
Artifacts: Model checkpoints, logs

Expected Metrics

Training Loss: Starts ~3.0, ends ~1.5
Validation Loss: Should be close to training loss
Learning Rate: Cosine decay from 8e-6 to 2e-6
Gradient Norm: Should stay below 1.0

Success Indicators

Converging loss: Steady decrease over time
Stable gradients: Consistent gradient norms
Good validation: Validation loss follows training loss
No overfitting: Validation loss doesn't increase

🚨 Troubleshooting

Common Issues

Out of Memory (OOM)

# Reduce batch size in config:
batch_size=12  # Instead of 16
gradient_accumulation_steps=6  # Instead of 4

Slow Training

# Check GPU utilization:
nvidia-smi
# Ensure CUDA is properly installed
python -c "import torch; print(torch.cuda.is_available())"

Poor Convergence

# Try different learning rate:
learning_rate=6e-6  # Instead of 8e-6
# Or increase warmup:
warmup_steps=100   # Instead of 50

Dataset Issues

# Check dataset loading:
python -c "from datasets import load_dataset; print(len(load_dataset('legmlai/openhermes-fr')['train']))"

Performance Tips

Use H100 if available: Significantly faster than A100
Monitor GPU memory: Keep utilization below 90%
Check logs regularly: Look for convergence issues
Save checkpoints: Don't lose progress
Use early stopping: Prevent overfitting

📋 Example Workflow

Complete H100 Lightweight Training

# 1. Setup
python setup_launch.py

# 2. Check requirements
python check_requirements.py

# 3. Run interactive pipeline
./launch.sh

# 4. Select configuration
# Choose: "H100 Lightweight (Rapid)"

# 5. Monitor training
# Watch Trackio Space for real-time progress

# 6. Check results
# Model will be pushed to HF Hub
# Summary in training_summary.md

Expected Output

✅ Dataset prepared: 80000 train samples, 1000 validation samples
📈 Training started with 5000 total steps
⏱️ Estimated time: 2-4 hours
📊 Monitor progress at: https://huggingface.co/spaces/...

🎉 Benefits

Speed

3-4x faster than full dataset training
Rapid iteration for research
Quick validation of ideas

Efficiency

Reduced costs (less GPU time)
Lower storage requirements
Faster experimentation cycle

Quality

Still high quality results
Good for prototyping
Suitable for many use cases

🔮 Future Enhancements

Planned Improvements

Adaptive sampling: Smart dataset selection
Multi-GPU support: Distributed training
Advanced monitoring: More detailed metrics
Auto-tuning: Automatic hyperparameter optimization

Extensibility

Custom datasets: Easy integration
Different models: Support for other architectures
Advanced sampling: Stratified, balanced sampling

Happy Rapid Training on H100! 🚀