SmolFactory / docs /A100_LARGE_SCALE_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
5.96 kB

A100 Large Scale Training Guide

This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.

Available Configurations

1. A100 Large Batch Configuration

File: config/train_smollm3_openhermes_fr_a100_large.py

Key Features:

  • Effective Batch Size: 128 (8 × 16 gradient accumulation)
  • Training Duration: ~1.3 passes (8,000 steps)
  • Learning Rate: 5e-6 (optimized for large batches)
  • Mixed Precision: bf16 (A100 optimized)
  • Sequence Length: 8192 tokens
  • Memory Optimizations: No gradient checkpointing for A100 efficiency

Estimated Training Time: ~6-8 hours on A100

2. Multiple Passes Configuration

File: config/train_smollm3_openhermes_fr_a100_multiple_passes.py

Key Features:

  • Effective Batch Size: 120 (6 × 20 gradient accumulation)
  • Training Duration: ~4 passes (25,000 steps)
  • Learning Rate: 3e-6 (conservative for long training)
  • Warmup Steps: 2000 (longer warmup for stability)
  • Checkpoint Strategy: More frequent saves (every 2000 steps)

Estimated Training Time: ~20-24 hours on A100

Training Commands

Quick Start - Large Batch Experiment

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_large.py \
    --experiment-name "smollm3_openhermes_fr_large_batch" \
    --output-dir ./outputs/large_batch

Multiple Passes Experiment

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
    --experiment-name "smollm3_openhermes_fr_multiple_passes" \
    --output-dir ./outputs/multiple_passes

Dry Run (Check Configuration)

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_large.py \
    --dry-run

Resume Training

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
    --resume ./outputs/multiple_passes/checkpoint-10000 \
    --output-dir ./outputs/multiple_passes

Configuration Details

Memory Usage Optimization

  • Gradient Checkpointing: Disabled for A100 efficiency
  • Flash Attention: Enabled for memory efficiency
  • bf16 Mixed Precision: Better for A100 than fp16
  • Gradient Clipping: 1.0 for stability
  • Group by Length: Enabled for better batching

Data Loading Optimization

  • Num Workers: 8 for faster data loading
  • Pin Memory: Enabled for GPU transfer efficiency
  • Prefetch Factor: 2 for pipeline optimization

Training Stability

  • Conservative Learning Rate: Lower LR for large effective batch sizes
  • Longer Warmup: More warmup steps for stability
  • Higher Beta2: 0.999 for AdamW stability
  • Gradient Clipping: Prevents gradient explosion

Expected Results

Large Batch Configuration (1.3 passes)

  • Training Steps: 8,000
  • Effective Batch Size: 128
  • Steps per Epoch: ~6,250
  • Epochs: ~1.3
  • Expected Loss: Should converge to ~1.5-2.0

Multiple Passes Configuration (4 passes)

  • Training Steps: 25,000
  • Effective Batch Size: 120
  • Steps per Epoch: ~6,667
  • Epochs: ~3.75
  • Expected Loss: Should converge to ~1.2-1.5

Monitoring and Logging

Trackio Integration

Both configurations include Trackio monitoring:

  • Metrics Logging: Every 25-50 steps
  • Artifact Logging: Model checkpoints
  • Config Logging: Training configuration

Checkpoint Strategy

  • Large Batch: Save every 1000 steps (8 checkpoints)
  • Multiple Passes: Save every 2000 steps (12 checkpoints)
  • Best Model: Automatically load best model at end

Hardware Requirements

Minimum Requirements

  • GPU: A100 80GB (or multiple A100s)
  • RAM: 64GB+ system RAM
  • Storage: 100GB+ for checkpoints and logs
  • Network: Fast internet for dataset download

Recommended Setup

  • GPU: 2-4x A100 80GB
  • RAM: 128GB+ system RAM
  • Storage: 500GB+ NVMe SSD
  • Network: 10Gbps+ connection

Troubleshooting

Out of Memory (OOM)

If you encounter OOM errors:

  1. Reduce batch_size from 8 to 6 or 4
  2. Increase gradient_accumulation_steps to maintain effective batch size
  3. Reduce max_seq_length from 8192 to 4096

Slow Training

If training is too slow:

  1. Increase dataloader_num_workers to 12-16
  2. Ensure you're using bf16 mixed precision
  3. Check that gradient checkpointing is disabled
  4. Verify flash attention is enabled

Convergence Issues

If loss doesn't converge:

  1. Reduce learning rate by 2x
  2. Increase warmup steps
  3. Check gradient norms in logs
  4. Verify dataset quality

Customization

For Different Dataset Sizes

Adjust max_iters based on your dataset size:

# For 1M datapoints with effective batch size 120
steps_per_epoch = 1000000 // 120  # ~8,333 steps
max_iters = steps_per_epoch * desired_epochs

For Different GPU Memory

Adjust batch size and gradient accumulation:

# For 40GB A100
batch_size = 4
gradient_accumulation_steps = 32  # Effective batch size = 128

# For 24GB GPU
batch_size = 2
gradient_accumulation_steps = 64  # Effective batch size = 128

Performance Tips

  1. Use bf16: Better than fp16 for A100
  2. Disable Gradient Checkpointing: A100 has enough memory
  3. Use Flash Attention: Memory efficient attention
  4. Group by Length: Better batching efficiency
  5. Pin Memory: Faster GPU transfers
  6. Multiple Workers: Faster data loading

Expected Timeline

  • Large Batch: 6-8 hours for 1.3 passes
  • Multiple Passes: 20-24 hours for 4 passes
  • Full Dataset (5+ passes): 30+ hours

Next Steps

After training completes:

  1. Evaluate on validation set
  2. Test generation quality
  3. Push to Hugging Face Hub
  4. Deploy for inference

For deployment instructions, see DEPLOYMENT_GUIDE.md.