Spaces:
Running
Running
| --- | |
| description: SmolLM3 Fine-tuning Pipeline - Project Rules and Conventions | |
| globs: ["**/*.py", "**/*.sh", "**/*.md", "**/*.json"] | |
| alwaysApply: true | |
| --- | |
| # SmolLM3 Fine-tuning Pipeline Project Rules | |
| ## Project Overview | |
| This is a comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with Trackio monitoring, Hugging Face integration, and interactive configuration management. | |
| ## Core Architecture | |
| ### Directory Structure | |
| - `config/` - Training configuration files for different scenarios | |
| - `src/` - Core training and model logic | |
| - `scripts/` - Utility scripts for deployment, dataset management, and model pushing | |
| - `docs/` - Comprehensive documentation and guides | |
| - `templates/` - Templates for HF Spaces and datasets | |
| - `tests/` - Test files and debugging scripts | |
| - `outputs/` - Training outputs and checkpoints | |
| ### Key Components | |
| #### Training Configurations | |
| - **Basic Training**: SmolLM3-3B + OpenHermes-FR, 3 epochs, batch size 2 | |
| - **H100 Lightweight**: SmolLM3-3B + OpenHermes-FR (80K samples), 1 epoch, batch size 16 | |
| - **A100 Large Scale**: SmolLM3-3B + OpenHermes-FR, 1.3 passes, batch size 8 | |
| - **Multiple Passes**: SmolLM3-3B + OpenHermes-FR, 4 epochs, batch size 6 | |
| - **Custom Configuration**: User-defined parameters | |
| #### Core Modules | |
| - `src/train.py` - Main training orchestration | |
| - `src/model.py` - Model loading and configuration | |
| - `src/data.py` - Dataset processing and loading | |
| - `src/monitoring.py` - Trackio integration and metrics | |
| - `src/trainer.py` - Training loop and optimization | |
| ## Coding Conventions | |
| ### Python Style | |
| - Use type hints for all function parameters and return values | |
| - Follow PEP 8 for formatting | |
| - Use descriptive variable names in snake_case | |
| - Add comprehensive docstrings for all functions | |
| - Use f-strings for string formatting | |
| ### Configuration Management | |
| - All training configs inherit from `SmolLM3Config` base class | |
| - Use dataclasses for configuration objects | |
| - Validate configuration parameters in __post_init__ | |
| - Support both YAML and Python configuration files | |
| ### Error Handling | |
| - Use try-except blocks for external API calls (HF, Trackio) | |
| - Log errors with appropriate context | |
| - Provide user-friendly error messages | |
| - Implement graceful degradation for optional features | |
| ### Monitoring Integration | |
| - Always include Trackio URL and experiment name in configs | |
| - Log metrics every N steps (configurable) | |
| - Save checkpoints and artifacts to HF Datasets | |
| - Use structured logging with consistent field names | |
| ## File Naming Conventions | |
| ### Configuration Files | |
| - `train_smollm3_*.py` - Training configurations | |
| - `*_config.py` - General configuration files | |
| - Use descriptive suffixes: `_h100_lightweight`, `_a100_large`, `_multiple_passes` | |
| ### Script Files | |
| - `deploy_*.py` - Deployment scripts | |
| - `setup_*.py` - Setup and initialization scripts | |
| - `push_*.py` - Model pushing scripts | |
| - `configure_*.py` - Configuration scripts | |
| ### Test Files | |
| - `test_*.py` - Test files | |
| - `debug_*.py` - Debugging scripts | |
| - Include descriptive names indicating what they test | |
| ## Training Pipeline Workflow | |
| ### Interactive Pipeline (`launch.sh`) | |
| 1. **Authentication**: HF username and token validation | |
| 2. **Configuration Selection**: Choose from predefined configs or custom | |
| 3. **Experiment Setup**: Configure experiment name and repositories | |
| 4. **Environment Setup**: Install dependencies and setup virtual environment | |
| 5. **Deployment**: Deploy Trackio Space and setup HF Dataset | |
| 6. **Training**: Execute training with monitoring | |
| 7. **Model Push**: Upload model to HF Hub with documentation | |
| 8. **Testing**: Validate uploaded model functionality | |
| ### Configuration Selection Logic | |
| - Basic Training: Default for beginners and learning | |
| - H100 Lightweight: Rapid experiments on H100 GPUs | |
| - A100 Large Scale: Serious research and production | |
| - Multiple Passes: Thorough training for production models | |
| - Custom: User-defined parameters for specific needs | |
| ## Dataset Management | |
| ### Supported Formats | |
| - Hugging Face Datasets format | |
| - JSON files with prompt/completion pairs | |
| - Chat format with messages array | |
| - Custom formats with conversion functions | |
| ### Dataset Processing | |
| - Automatic format detection and conversion | |
| - Random sampling for lightweight configurations | |
| - Validation split creation | |
| - Bad entry filtering and handling | |
| ### Dataset Sampling (H100 Lightweight) | |
| - 80,000 random samples from OpenHermes-FR | |
| - 1,000 validation samples (if available) | |
| - Fixed random seed (42) for reproducibility | |
| - Automatic sampling during dataset preparation | |
| ## Model Management | |
| ### Model Loading | |
| - Support for HuggingFaceTB/SmolLM3-3B | |
| - Flash attention and gradient checkpointing | |
| - Mixed precision training (fp16/bf16) | |
| - Device mapping and memory optimization | |
| ### Model Pushing | |
| - Comprehensive model cards with training details | |
| - Automatic README generation | |
| - License and usage information | |
| - Training metrics and configuration | |
| ## Monitoring and Tracking | |
| ### Trackio Integration | |
| - Real-time metrics logging | |
| - Training curves visualization | |
| - Resource usage monitoring | |
| - Artifact storage and versioning | |
| ### Metrics to Track | |
| - Training and validation loss | |
| - Learning rate schedule | |
| - Gradient norms | |
| - GPU utilization and memory | |
| - Training speed (steps/second) | |
| ## Error Handling and Validation | |
| ### Input Validation | |
| - Validate HF tokens before use | |
| - Check CUDA availability | |
| - Verify dataset accessibility | |
| - Validate configuration parameters | |
| ### Error Recovery | |
| - Graceful handling of network issues | |
| - Automatic retry for failed operations | |
| - Checkpoint recovery for interrupted training | |
| - Fallback options for optional features | |
| ## Documentation Standards | |
| ### README Files | |
| - Clear project description | |
| - Installation instructions | |
| - Usage examples | |
| - Configuration options | |
| - Troubleshooting guide | |
| ### Code Documentation | |
| - Comprehensive docstrings | |
| - Type hints for all functions | |
| - Example usage in docstrings | |
| - Parameter descriptions | |
| - Return value documentation | |
| ## Testing and Validation | |
| ### Test Categories | |
| - Unit tests for core functions | |
| - Integration tests for pipeline | |
| - Configuration validation tests | |
| - Model loading and saving tests | |
| - Dataset processing tests | |
| ### Debugging Tools | |
| - Standalone test scripts | |
| - Configuration validation | |
| - Model testing utilities | |
| - Dataset inspection tools | |
| ## Performance Optimization | |
| ### H100 Optimizations | |
| - Larger batch sizes (16 vs 8 for A100) | |
| - Reduced gradient accumulation (4 vs 16) | |
| - Higher learning rates (8e-6 vs 5e-6) | |
| - Optimized data loading (4 workers, pin memory) | |
| ### Memory Management | |
| - Gradient checkpointing for large models | |
| - Mixed precision training | |
| - Dynamic batch sizing | |
| - Memory-efficient data loading | |
| ## Security and Best Practices | |
| ### Token Management | |
| - Never hardcode tokens in code | |
| - Use environment variables | |
| - Validate tokens before use | |
| - Secure token storage | |
| ### Data Privacy | |
| - Filter sensitive data from datasets | |
| - Validate dataset contents | |
| - Secure data transmission | |
| - Proper data disposal | |
| ## Deployment and CI/CD | |
| ### Environment Setup | |
| - Python virtual environments | |
| - CUDA-compatible PyTorch | |
| - Required dependencies installation | |
| - System package management | |
| ### Automated Deployment | |
| - Trackio Space deployment | |
| - HF Dataset setup | |
| - Model repository creation | |
| - Configuration file generation | |
| ## Troubleshooting Guidelines | |
| ### Common Issues | |
| - CUDA out of memory: Reduce batch size | |
| - Network timeouts: Check internet connection | |
| - Token validation: Verify HF token permissions | |
| - Dataset loading: Check dataset accessibility | |
| ### Debugging Steps | |
| 1. Check system requirements | |
| 2. Validate configuration | |
| 3. Test individual components | |
| 4. Review logs and error messages | |
| 5. Verify external service connectivity | |
| ## Future Enhancements | |
| ### Planned Features | |
| - Multi-GPU training support | |
| - Advanced dataset sampling strategies | |
| - Automated hyperparameter optimization | |
| - Enhanced monitoring and visualization | |
| - Support for additional model architectures | |
| ### Extensibility | |
| - Modular configuration system | |
| - Plugin architecture for custom features | |
| - Support for custom datasets and models | |
| - Flexible monitoring integration | |
| --- | |
| **When working with this codebase:** | |
| - Always consider the end-to-end pipeline workflow | |
| - Follow the established configuration patterns | |
| - Include proper error handling and validation | |
| - Maintain comprehensive documentation | |
| - Test changes thoroughly before deployment | |
| - Consider performance implications for different hardware configurations |