# Cloud Training Guide for OpenHermes-FR Dataset This guide provides step-by-step instructions for training SmolLM3 models on cloud instances using the [legmlai/openhermes-fr](https://huggingface.co/datasets/legmlai/openhermes-fr) dataset. ## Overview The OpenHermes-FR dataset contains 799,875 French instruction-response pairs, perfect for fine-tuning SmolLM3 models for French language tasks. This guide covers: - ✅ **Cloud Instance Setup** - Complete environment configuration - ✅ **Dataset Integration** - Automatic loading and filtering - ✅ **Training Configuration** - Optimized for French instruction tuning - ✅ **Monitoring Integration** - Trackio experiment tracking - ✅ **Model Deployment** - Push to Hugging Face Hub ## Dataset Information ### Schema ```json { "prompt": "Explique la différence entre la photosynthèse C3 et C4.", "accepted_completion": "La photosynthèse C3 utilise… (réponse détaillée)", "bad_prompt_detected": false, "bad_response_detected": false, "bad_entry": false } ``` ### Key Features - **Size**: 799,875 examples (~1.4GB) - **Language**: 100% French - **Quality**: GPT-4o generated responses with automatic filtering - **License**: ODC-BY 1.0 ## Cloud Instance Setup ### 1. Choose Your Cloud Provider #### **AWS EC2 (Recommended)** ```bash # Launch instance with GPU # Recommended: g4dn.xlarge or g5.xlarge # AMI: Deep Learning AMI (Ubuntu 20.04) ``` #### **Google Cloud Platform** ```bash # Launch instance with GPU # Recommended: n1-standard-4 with Tesla T4 or V100 ``` #### **Azure** ```bash # Launch instance with GPU # Recommended: Standard_NC6s_v3 or Standard_NC12s_v3 ``` ### 2. Instance Specifications #### **Minimum Requirements** - **GPU**: 16GB+ VRAM (Tesla T4, V100, or A100) - **RAM**: 32GB+ system memory - **Storage**: 100GB+ SSD - **CPU**: 8+ cores #### **Recommended Specifications** - **GPU**: A100 (40GB) or H100 (80GB) - **RAM**: 64GB+ system memory - **Storage**: 200GB+ NVMe SSD - **CPU**: 16+ cores ### 3. Environment Setup ```bash # Update system sudo apt update && sudo apt upgrade -y # Install CUDA (if not pre-installed) # Follow NVIDIA CUDA installation guide for your GPU # Install Python dependencies sudo apt install python3-pip python3-venv git -y # Create virtual environment python3 -m venv smollm3_env source smollm3_env/bin/activate # Clone repository git clone cd # Install dependencies pip install -r requirements.txt # Install additional dependencies for cloud training pip install accelerate transformers datasets huggingface_hub ``` ## Training Configuration ### 1. Use the OpenHermes-FR Config The repository includes a specialized configuration for the OpenHermes-FR dataset: ```bash python train.py config/train_smollm3_openhermes_fr.py \ --enable_tracking \ --trackio_url "https://your-space.hf.space" \ --experiment_name "smollm3_fr_openhermes_v1" ``` ### 2. Configuration Details The `config/train_smollm3_openhermes_fr.py` includes: #### **Dataset Configuration** ```python dataset_name: str = "legmlai/openhermes-fr" dataset_split: str = "train" input_field: str = "prompt" target_field: str = "accepted_completion" filter_bad_entries: bool = True bad_entry_field: str = "bad_entry" ``` #### **Training Optimization** ```python batch_size: int = 2 # Reduced for French text (longer sequences) gradient_accumulation_steps: int = 8 # Maintains effective batch size learning_rate: float = 1e-5 # Lower for instruction tuning max_iters: int = 2000 # More iterations for large dataset ``` #### **Monitoring Integration** ```python enable_tracking: bool = True experiment_name: str = "smollm3_openhermes_fr" ``` ## Training Commands ### Basic Training ```bash python train.py config/train_smollm3_openhermes_fr.py ``` ### Training with Monitoring ```bash python train.py config/train_smollm3_openhermes_fr.py \ --enable_tracking \ --trackio_url "https://your-trackio-space.hf.space" \ --experiment_name "smollm3_fr_openhermes_v1" ``` ### Training with Custom Parameters ```bash python train.py config/train_smollm3_openhermes_fr.py \ --batch_size 4 \ --learning_rate 2e-5 \ --max_iters 3000 \ --enable_tracking \ --trackio_url "https://your-trackio-space.hf.space" \ --experiment_name "smollm3_fr_high_lr" ``` ### Training with Checkpoint Resume ```bash python train.py config/train_smollm3_openhermes_fr.py \ --init_from resume \ --enable_tracking \ --trackio_url "https://your-trackio-space.hf.space" \ --experiment_name "smollm3_fr_resume" ``` ## Dataset Processing ### Automatic Filtering The training script automatically: - ✅ **Loads** the OpenHermes-FR dataset from Hugging Face - ✅ **Filters** out bad entries (`bad_entry = true`) - ✅ **Splits** data into train/validation/test (98/1/1) - ✅ **Formats** prompts and completions for instruction tuning ### Manual Dataset Inspection ```python from datasets import load_dataset # Load dataset dataset = load_dataset("legmlai/openhermes-fr") # Check dataset info print(f"Dataset size: {len(dataset['train'])}") print(f"Sample columns: {dataset['train'].column_names}") # Check filtering bad_entries = dataset['train'].filter(lambda x: x['bad_entry']) print(f"Bad entries: {len(bad_entries)}") # Sample data sample = dataset['train'][0] print(f"Prompt: {sample['prompt']}") print(f"Completion: {sample['accepted_completion']}") ``` ## Monitoring and Tracking ### Trackio Integration The training automatically logs: - **Training metrics**: Loss, accuracy, learning rate - **System metrics**: GPU memory, CPU usage - **Dataset info**: Size, filtering statistics - **Model checkpoints**: Regular saves with metadata ### View Training Progress 1. **Trackio Space**: Visit your Trackio Space URL 2. **Experiment Details**: Check the "View Experiments" tab 3. **Metrics**: Monitor loss curves and system usage 4. **Logs**: Download training logs for analysis ## Model Deployment ### Push to Hugging Face Hub After training, deploy your model: ```bash python push_to_huggingface.py /output-checkpoint username/smollm3-fr-openhermes \ --trackio-url "https://your-trackio-space.hf.space" \ --experiment-name "smollm3_fr_openhermes_v1" ``` ### Use Your Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load your fine-tuned model model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-openhermes") tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-openhermes") # Generate French text prompt = "Expliquez le concept de l'intelligence artificielle." inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Performance Optimization ### GPU Memory Management ```bash # Monitor GPU usage nvidia-smi -l 1 # Optimize for your GPU # For 16GB VRAM: batch_size=2, gradient_accumulation_steps=8 # For 24GB VRAM: batch_size=4, gradient_accumulation_steps=4 # For 40GB+ VRAM: batch_size=8, gradient_accumulation_steps=2 ``` ### Training Speed ```bash # Use mixed precision (enabled by default) fp16: bool = True # Enable gradient checkpointing (enabled by default) use_gradient_checkpointing: bool = True # Use flash attention (enabled by default) use_flash_attention: bool = True ``` ## Troubleshooting ### Common Issues #### 1. **Out of Memory (OOM)** ```bash # Reduce batch size python train.py config/train_smollm3_openhermes_fr.py --batch_size 1 # Increase gradient accumulation # Edit config: gradient_accumulation_steps = 16 ``` #### 2. **Slow Training** ```bash # Check GPU utilization nvidia-smi # Verify data loading # Check if dataset is cached locally ``` #### 3. **Dataset Loading Issues** ```bash # Clear cache rm -rf ~/.cache/huggingface/ # Check internet connection # Verify dataset name: "legmlai/openhermes-fr" ``` #### 4. **Monitoring Connection Issues** ```bash # Test Trackio connection curl -I https://your-trackio-space.hf.space # Check token permissions # Verify experiment name format ``` ### Debug Mode ```bash # Enable debug logging export LOG_LEVEL=DEBUG python train.py config/train_smollm3_openhermes_fr.py ``` ## Cost Optimization ### Cloud Provider Tips #### **AWS EC2** - Use Spot Instances for cost savings - Monitor usage with CloudWatch - Use appropriate instance types #### **Google Cloud Platform** - Use Preemptible VMs for non-critical training - Monitor with Cloud Monitoring - Use committed use discounts #### **Azure** - Use Spot VMs for cost optimization - Monitor with Azure Monitor - Use reserved instances for long training ### Training Time Estimates | GPU Type | Batch Size | Estimated Time | |----------|------------|----------------| | Tesla T4 (16GB) | 2 | 8-12 hours | | V100 (32GB) | 4 | 4-6 hours | | A100 (40GB) | 8 | 2-3 hours | | H100 (80GB) | 16 | 1-2 hours | ## Security Best Practices ### Token Management ```bash # Use environment variables export HF_TOKEN="your_token_here" export TRACKIO_TOKEN="your_trackio_token" # Don't hardcode in scripts # Use IAM roles when possible ``` ### Data Privacy ```bash # Use private repositories for sensitive models python push_to_huggingface.py model username/private-model --private # Secure your cloud instance # Use VPC and security groups ``` ## Complete Workflow Example ### 1. Setup Cloud Instance ```bash # Launch GPU instance # Install dependencies git clone cd pip install -r requirements.txt ``` ### 2. Train Model ```bash python train.py config/train_smollm3_openhermes_fr.py \ --enable_tracking \ --trackio_url "https://your-space.hf.space" \ --experiment_name "smollm3_fr_v1" ``` ### 3. Deploy Model ```bash python push_to_huggingface.py /output-checkpoint username/smollm3-fr-v1 \ --trackio-url "https://your-space.hf.space" \ --experiment-name "smollm3_fr_v1" ``` ### 4. Test Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("username/smollm3-fr-v1") tokenizer = AutoTokenizer.from_pretrained("username/smollm3-fr-v1") # Test French generation prompt = "Qu'est-ce que l'apprentissage automatique?" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Support and Resources ### Documentation - [OpenHermes-FR Dataset](https://huggingface.co/datasets/legmlai/openhermes-fr) - [SmolLM3 Model](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) - [Trackio Monitoring](https://github.com/Josephrp/trackio) ### Community - [Hugging Face Forums](https://discuss.huggingface.co/) - [Transformers Documentation](https://huggingface.co/docs/transformers/) ### Examples - [French Language Models](https://huggingface.co/models?search=french) - [Instruction Tuned Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads) ## Conclusion This guide provides everything needed to train SmolLM3 models on the OpenHermes-FR dataset in the cloud: - ✅ **Complete Setup** - From cloud instance to model deployment - ✅ **Optimized Configuration** - Tailored for French instruction tuning - ✅ **Monitoring Integration** - Trackio experiment tracking - ✅ **Cost Optimization** - Tips for efficient cloud usage - ✅ **Troubleshooting** - Solutions for common issues Start training your French language model today!