SmolFactory / docs /CLOUD_TRAINING_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
11.5 kB

Cloud Training Guide for OpenHermes-FR Dataset

This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the legmlai/openhermes-fr dataset.

Overview

The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers:

  • Cloud Instance Setup - Complete environment configuration
  • Dataset Integration - Automatic loading and filtering
  • Training Configuration - Optimized for French instruction tuning
  • Monitoring Integration - Trackio experiment tracking
  • Model Deployment - Push to Hugging Face Hub

Dataset Information

Schema

{
  "prompt": "Explique la différence entre la photosynthèse C3 et C4.",
  "accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)",
  "bad_prompt_detected": false,
  "bad_response_detected": false,
  "bad_entry": false
}

Key Features

  • Size: 799,875 examples (~1.4GB)
  • Language: 100% French
  • Quality: GPT-4o generated responses with automatic filtering
  • License: ODC-BY 1.0

Cloud Instance Setup

1. Choose Your Cloud Provider

AWS EC2 (Recommended)

# Launch instance with GPU
# Recommended: g4dn.xlarge or g5.xlarge
# AMI: Deep Learning AMI (Ubuntu 20.04)

Google Cloud Platform

# Launch instance with GPU
# Recommended: n1-standard-4 with Tesla T4 or V100

Azure

# Launch instance with GPU
# Recommended: Standard_NC6s_v3 or Standard_NC12s_v3

2. Instance Specifications

Minimum Requirements

  • GPU: 16GB+ VRAM (Tesla T4, V100, or A100)
  • RAM: 32GB+ system memory
  • Storage: 100GB+ SSD
  • CPU: 8+ cores

Recommended Specifications

  • GPU: A100 (40GB) or H100 (80GB)
  • RAM: 64GB+ system memory
  • Storage: 200GB+ NVMe SSD
  • CPU: 16+ cores

3. Environment Setup

# Update system
sudo apt update && sudo apt upgrade -y

# Install CUDA (if not pre-installed)
# Follow NVIDIA CUDA installation guide for your GPU

# Install Python dependencies
sudo apt install python3-pip python3-venv git -y

# Create virtual environment
python3 -m venv smollm3_env
source smollm3_env/bin/activate

# Clone repository
git clone <your-repo-url>
cd <your-repo-directory>

# Install dependencies
pip install -r requirements.txt

# Install additional dependencies for cloud training
pip install accelerate transformers datasets huggingface_hub

Training Configuration

1. Use the OpenHermes-FR Config

The repository includes a specialized configuration for the OpenHermes-FR dataset:

python train.py config/train_smollm3_openhermes_fr.py \
    --enable_tracking \
    --trackio_url "https://your-space.hf.space" \
    --experiment_name "smollm3_fr_openhermes_v1"

2. Configuration Details

The config/train_smollm3_openhermes_fr.py includes:

Dataset Configuration

dataset_name: str = "legmlai/openhermes-fr"
dataset_split: str = "train"
input_field: str = "prompt"
target_field: str = "accepted_completion"
filter_bad_entries: bool = True
bad_entry_field: str = "bad_entry"

Training Optimization

batch_size: int = 2  # Reduced for French text (longer sequences)
gradient_accumulation_steps: int = 8  # Maintains effective batch size
learning_rate: float = 1e-5  # Lower for instruction tuning
max_iters: int = 2000  # More iterations for large dataset

Monitoring Integration

enable_tracking: bool = True
experiment_name: str = "smollm3_openhermes_fr"

Training Commands

Basic Training

python train.py config/train_smollm3_openhermes_fr.py

Training with Monitoring

python train.py config/train_smollm3_openhermes_fr.py \
    --enable_tracking \
    --trackio_url "https://your-trackio-space.hf.space" \
    --experiment_name "smollm3_fr_openhermes_v1"

Training with Custom Parameters

python train.py config/train_smollm3_openhermes_fr.py \
    --batch_size 4 \
    --learning_rate 2e-5 \
    --max_iters 3000 \
    --enable_tracking \
    --trackio_url "https://your-trackio-space.hf.space" \
    --experiment_name "smollm3_fr_high_lr"

Training with Checkpoint Resume

python train.py config/train_smollm3_openhermes_fr.py \
    --init_from resume \
    --enable_tracking \
    --trackio_url "https://your-trackio-space.hf.space" \
    --experiment_name "smollm3_fr_resume"

Dataset Processing

Automatic Filtering

The training script automatically:

  • Loads the OpenHermes-FR dataset from Hugging Face
  • Filters out bad entries (bad_entry = true)
  • Splits data into train/validation/test (98/1/1)
  • Formats prompts and completions for instruction tuning

Manual Dataset Inspection

from datasets import load_dataset

# Load dataset
dataset = load_dataset("legmlai/openhermes-fr")

# Check dataset info
print(f"Dataset size: {len(dataset['train'])}")
print(f"Sample columns: {dataset['train'].column_names}")

# Check filtering
bad_entries = dataset['train'].filter(lambda x: x['bad_entry'])
print(f"Bad entries: {len(bad_entries)}")

# Sample data
sample = dataset['train'][0]
print(f"Prompt: {sample['prompt']}")
print(f"Completion: {sample['accepted_completion']}")

Monitoring and Tracking

Trackio Integration

The training automatically logs:

  • Training metrics: Loss, accuracy, learning rate
  • System metrics: GPU memory, CPU usage
  • Dataset info: Size, filtering statistics
  • Model checkpoints: Regular saves with metadata

View Training Progress

  1. Trackio Space: Visit your Trackio Space URL
  2. Experiment Details: Check the "View Experiments" tab
  3. Metrics: Monitor loss curves and system usage
  4. Logs: Download training logs for analysis

Model Deployment

Push to Hugging Face Hub

After training, deploy your model:

python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \
    --trackio-url "https://your-trackio-space.hf.space" \
    --experiment-name "smollm3_fr_openhermes_v1"

Use Your Model

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load your fine-tuned model
model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes")
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes")

# Generate French text
prompt = "Expliquez le concept de l'intelligence artificielle."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance Optimization

GPU Memory Management

# Monitor GPU usage
nvidia-smi -l 1

# Optimize for your GPU
# For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8
# For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4
# For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2

Training Speed

# Use mixed precision (enabled by default)
fp16: bool = True

# Enable gradient checkpointing (enabled by default)
use_gradient_checkpointing: bool = True

# Use flash attention (enabled by default)
use_flash_attention: bool = True

Troubleshooting

Common Issues

1. Out of Memory (OOM)

# Reduce batch size
python train.py config/train_smollm3_openhermes_fr.py --batch_size 1

# Increase gradient accumulation
# Edit config: gradient_accumulation_steps = 16

2. Slow Training

# Check GPU utilization
nvidia-smi

# Verify data loading
# Check if dataset is cached locally

3. Dataset Loading Issues

# Clear cache
rm -rf ~/.cache/huggingface/

# Check internet connection
# Verify dataset name: "legmlai/openhermes-fr"

4. Monitoring Connection Issues

# Test Trackio connection
curl -I https://your-trackio-space.hf.space

# Check token permissions
# Verify experiment name format

Debug Mode

# Enable debug logging
export LOG_LEVEL=DEBUG
python train.py config/train_smollm3_openhermes_fr.py

Cost Optimization

Cloud Provider Tips

AWS EC2

  • Use Spot Instances for cost savings
  • Monitor usage with CloudWatch
  • Use appropriate instance types

Google Cloud Platform

  • Use Preemptible VMs for non-critical training
  • Monitor with Cloud Monitoring
  • Use committed use discounts

Azure

  • Use Spot VMs for cost optimization
  • Monitor with Azure Monitor
  • Use reserved instances for long training

Training Time Estimates

GPU Type Batch Size Estimated Time
Tesla T4 (16GB) 2 8-12 hours
V100 (32GB) 4 4-6 hours
A100 (40GB) 8 2-3 hours
H100 (80GB) 16 1-2 hours

Security Best Practices

Token Management

# Use environment variables
export HF_TOKEN="your_token_here"
export TRACKIO_TOKEN="your_trackio_token"

# Don't hardcode in scripts
# Use IAM roles when possible

Data Privacy

# Use private repositories for sensitive models
python push_to_huggingface.py model username/private-model --private

# Secure your cloud instance
# Use VPC and security groups

Complete Workflow Example

1. Setup Cloud Instance

# Launch GPU instance
# Install dependencies
git clone <your-repo>
cd <your-repo>
pip install -r requirements.txt

2. Train Model

python train.py config/train_smollm3_openhermes_fr.py \
    --enable_tracking \
    --trackio_url "https://your-space.hf.space" \
    --experiment_name "smollm3_fr_v1"

3. Deploy Model

python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \
    --trackio-url "https://your-space.hf.space" \
    --experiment-name "smollm3_fr_v1"

4. Test Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1")
tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1")

# Test French generation
prompt = "Qu'est-ce que l'apprentissage automatique?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Support and Resources

Documentation

Community

Examples

Conclusion

This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud:

  • Complete Setup - From cloud instance to model deployment
  • Optimized Configuration - Tailored for French instruction tuning
  • Monitoring Integration - Trackio experiment tracking
  • Cost Optimization - Tips for efficient cloud usage
  • Troubleshooting - Solutions for common issues

Start training your French language model today!