Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /A100_LARGE_SCALE_GUIDE.md

Tonic

adds formatting fix

ebe598e verified 2 months ago

preview code

raw

history blame

5.96 kB

A100 Large Scale Training Guide

This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.

Available Configurations

1. A100 Large Batch Configuration

File: config/train_smollm3_openhermes_fr_a100_large.py

Key Features:

Effective Batch Size: 128 (8 × 16 gradient accumulation)
Training Duration: ~1.3 passes (8,000 steps)
Learning Rate: 5e-6 (optimized for large batches)
Mixed Precision: bf16 (A100 optimized)
Sequence Length: 8192 tokens
Memory Optimizations: No gradient checkpointing for A100 efficiency

Estimated Training Time: ~6-8 hours on A100

2. Multiple Passes Configuration

File: config/train_smollm3_openhermes_fr_a100_multiple_passes.py

Key Features:

Effective Batch Size: 120 (6 × 20 gradient accumulation)
Training Duration: ~4 passes (25,000 steps)
Learning Rate: 3e-6 (conservative for long training)
Warmup Steps: 2000 (longer warmup for stability)
Checkpoint Strategy: More frequent saves (every 2000 steps)

Estimated Training Time: ~20-24 hours on A100

Training Commands

Quick Start - Large Batch Experiment

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_large.py \
    --experiment-name "smollm3_openhermes_fr_large_batch" \
    --output-dir ./outputs/large_batch

Multiple Passes Experiment

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
    --experiment-name "smollm3_openhermes_fr_multiple_passes" \
    --output-dir ./outputs/multiple_passes

Dry Run (Check Configuration)

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_large.py \
    --dry-run

Resume Training

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
    --resume ./outputs/multiple_passes/checkpoint-10000 \
    --output-dir ./outputs/multiple_passes

Configuration Details

Memory Usage Optimization

Gradient Checkpointing: Disabled for A100 efficiency
Flash Attention: Enabled for memory efficiency
bf16 Mixed Precision: Better for A100 than fp16
Gradient Clipping: 1.0 for stability
Group by Length: Enabled for better batching

Data Loading Optimization

Num Workers: 8 for faster data loading
Pin Memory: Enabled for GPU transfer efficiency
Prefetch Factor: 2 for pipeline optimization

Training Stability

Conservative Learning Rate: Lower LR for large effective batch sizes
Longer Warmup: More warmup steps for stability
Higher Beta2: 0.999 for AdamW stability
Gradient Clipping: Prevents gradient explosion

Expected Results

Large Batch Configuration (1.3 passes)

Training Steps: 8,000
Effective Batch Size: 128
Steps per Epoch: ~6,250
Epochs: ~1.3
Expected Loss: Should converge to ~1.5-2.0

Multiple Passes Configuration (4 passes)

Training Steps: 25,000
Effective Batch Size: 120
Steps per Epoch: ~6,667
Epochs: ~3.75
Expected Loss: Should converge to ~1.2-1.5

Monitoring and Logging

Trackio Integration

Both configurations include Trackio monitoring:

Metrics Logging: Every 25-50 steps
Artifact Logging: Model checkpoints
Config Logging: Training configuration

Checkpoint Strategy

Large Batch: Save every 1000 steps (8 checkpoints)
Multiple Passes: Save every 2000 steps (12 checkpoints)
Best Model: Automatically load best model at end

Hardware Requirements

Minimum Requirements

GPU: A100 80GB (or multiple A100s)
RAM: 64GB+ system RAM
Storage: 100GB+ for checkpoints and logs
Network: Fast internet for dataset download

Recommended Setup

GPU: 2-4x A100 80GB
RAM: 128GB+ system RAM
Storage: 500GB+ NVMe SSD
Network: 10Gbps+ connection

Troubleshooting

Out of Memory (OOM)

If you encounter OOM errors:

Reduce batch_size from 8 to 6 or 4
Increase gradient_accumulation_steps to maintain effective batch size
Reduce max_seq_length from 8192 to 4096

Slow Training

If training is too slow:

Increase dataloader_num_workers to 12-16
Ensure you're using bf16 mixed precision
Check that gradient checkpointing is disabled
Verify flash attention is enabled

Convergence Issues

If loss doesn't converge:

Reduce learning rate by 2x
Increase warmup steps
Check gradient norms in logs
Verify dataset quality

Customization

For Different Dataset Sizes

Adjust max_iters based on your dataset size:

# For 1M datapoints with effective batch size 120
steps_per_epoch = 1000000 // 120  # ~8,333 steps
max_iters = steps_per_epoch * desired_epochs

For Different GPU Memory

Adjust batch size and gradient accumulation:

# For 40GB A100
batch_size = 4
gradient_accumulation_steps = 32  # Effective batch size = 128

# For 24GB GPU
batch_size = 2
gradient_accumulation_steps = 64  # Effective batch size = 128

Performance Tips

Use bf16: Better than fp16 for A100
Disable Gradient Checkpointing: A100 has enough memory
Use Flash Attention: Memory efficient attention
Group by Length: Better batching efficiency
Pin Memory: Faster GPU transfers
Multiple Workers: Faster data loading

Expected Timeline

Large Batch: 6-8 hours for 1.3 passes
Multiple Passes: 20-24 hours for 4 passes
Full Dataset (5+ passes): 30+ hours

Next Steps

After training completes:

Evaluate on validation set
Test generation quality
Push to Hugging Face Hub
Deploy for inference

For deployment instructions, see DEPLOYMENT_GUIDE.md.