SmolFactory / .cursorrules
Tonic's picture
adds more documentation
8f6fe61 verified
raw
history blame
8.36 kB
---
description: SmolLM3 Fine-tuning Pipeline - Project Rules and Conventions
globs: ["**/*.py", "**/*.sh", "**/*.md", "**/*.json"]
alwaysApply: true
---
# SmolLM3 Fine-tuning Pipeline Project Rules
## Project Overview
This is a comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with Trackio monitoring, Hugging Face integration, and interactive configuration management.
## Core Architecture
### Directory Structure
- `config/` - Training configuration files for different scenarios
- `src/` - Core training and model logic
- `scripts/` - Utility scripts for deployment, dataset management, and model pushing
- `docs/` - Comprehensive documentation and guides
- `templates/` - Templates for HF Spaces and datasets
- `tests/` - Test files and debugging scripts
- `outputs/` - Training outputs and checkpoints
### Key Components
#### Training Configurations
- **Basic Training**: SmolLM3-3B + OpenHermes-FR, 3 epochs, batch size 2
- **H100 Lightweight**: SmolLM3-3B + OpenHermes-FR (80K samples), 1 epoch, batch size 16
- **A100 Large Scale**: SmolLM3-3B + OpenHermes-FR, 1.3 passes, batch size 8
- **Multiple Passes**: SmolLM3-3B + OpenHermes-FR, 4 epochs, batch size 6
- **Custom Configuration**: User-defined parameters
#### Core Modules
- `src/train.py` - Main training orchestration
- `src/model.py` - Model loading and configuration
- `src/data.py` - Dataset processing and loading
- `src/monitoring.py` - Trackio integration and metrics
- `src/trainer.py` - Training loop and optimization
## Coding Conventions
### Python Style
- Use type hints for all function parameters and return values
- Follow PEP 8 for formatting
- Use descriptive variable names in snake_case
- Add comprehensive docstrings for all functions
- Use f-strings for string formatting
### Configuration Management
- All training configs inherit from `SmolLM3Config` base class
- Use dataclasses for configuration objects
- Validate configuration parameters in __post_init__
- Support both YAML and Python configuration files
### Error Handling
- Use try-except blocks for external API calls (HF, Trackio)
- Log errors with appropriate context
- Provide user-friendly error messages
- Implement graceful degradation for optional features
### Monitoring Integration
- Always include Trackio URL and experiment name in configs
- Log metrics every N steps (configurable)
- Save checkpoints and artifacts to HF Datasets
- Use structured logging with consistent field names
## File Naming Conventions
### Configuration Files
- `train_smollm3_*.py` - Training configurations
- `*_config.py` - General configuration files
- Use descriptive suffixes: `_h100_lightweight`, `_a100_large`, `_multiple_passes`
### Script Files
- `deploy_*.py` - Deployment scripts
- `setup_*.py` - Setup and initialization scripts
- `push_*.py` - Model pushing scripts
- `configure_*.py` - Configuration scripts
### Test Files
- `test_*.py` - Test files
- `debug_*.py` - Debugging scripts
- Include descriptive names indicating what they test
## Training Pipeline Workflow
### Interactive Pipeline (`launch.sh`)
1. **Authentication**: HF username and token validation
2. **Configuration Selection**: Choose from predefined configs or custom
3. **Experiment Setup**: Configure experiment name and repositories
4. **Environment Setup**: Install dependencies and setup virtual environment
5. **Deployment**: Deploy Trackio Space and setup HF Dataset
6. **Training**: Execute training with monitoring
7. **Model Push**: Upload model to HF Hub with documentation
8. **Testing**: Validate uploaded model functionality
### Configuration Selection Logic
- Basic Training: Default for beginners and learning
- H100 Lightweight: Rapid experiments on H100 GPUs
- A100 Large Scale: Serious research and production
- Multiple Passes: Thorough training for production models
- Custom: User-defined parameters for specific needs
## Dataset Management
### Supported Formats
- Hugging Face Datasets format
- JSON files with prompt/completion pairs
- Chat format with messages array
- Custom formats with conversion functions
### Dataset Processing
- Automatic format detection and conversion
- Random sampling for lightweight configurations
- Validation split creation
- Bad entry filtering and handling
### Dataset Sampling (H100 Lightweight)
- 80,000 random samples from OpenHermes-FR
- 1,000 validation samples (if available)
- Fixed random seed (42) for reproducibility
- Automatic sampling during dataset preparation
## Model Management
### Model Loading
- Support for HuggingFaceTB/SmolLM3-3B
- Flash attention and gradient checkpointing
- Mixed precision training (fp16/bf16)
- Device mapping and memory optimization
### Model Pushing
- Comprehensive model cards with training details
- Automatic README generation
- License and usage information
- Training metrics and configuration
## Monitoring and Tracking
### Trackio Integration
- Real-time metrics logging
- Training curves visualization
- Resource usage monitoring
- Artifact storage and versioning
### Metrics to Track
- Training and validation loss
- Learning rate schedule
- Gradient norms
- GPU utilization and memory
- Training speed (steps/second)
## Error Handling and Validation
### Input Validation
- Validate HF tokens before use
- Check CUDA availability
- Verify dataset accessibility
- Validate configuration parameters
### Error Recovery
- Graceful handling of network issues
- Automatic retry for failed operations
- Checkpoint recovery for interrupted training
- Fallback options for optional features
## Documentation Standards
### README Files
- Clear project description
- Installation instructions
- Usage examples
- Configuration options
- Troubleshooting guide
### Code Documentation
- Comprehensive docstrings
- Type hints for all functions
- Example usage in docstrings
- Parameter descriptions
- Return value documentation
## Testing and Validation
### Test Categories
- Unit tests for core functions
- Integration tests for pipeline
- Configuration validation tests
- Model loading and saving tests
- Dataset processing tests
### Debugging Tools
- Standalone test scripts
- Configuration validation
- Model testing utilities
- Dataset inspection tools
## Performance Optimization
### H100 Optimizations
- Larger batch sizes (16 vs 8 for A100)
- Reduced gradient accumulation (4 vs 16)
- Higher learning rates (8e-6 vs 5e-6)
- Optimized data loading (4 workers, pin memory)
### Memory Management
- Gradient checkpointing for large models
- Mixed precision training
- Dynamic batch sizing
- Memory-efficient data loading
## Security and Best Practices
### Token Management
- Never hardcode tokens in code
- Use environment variables
- Validate tokens before use
- Secure token storage
### Data Privacy
- Filter sensitive data from datasets
- Validate dataset contents
- Secure data transmission
- Proper data disposal
## Deployment and CI/CD
### Environment Setup
- Python virtual environments
- CUDA-compatible PyTorch
- Required dependencies installation
- System package management
### Automated Deployment
- Trackio Space deployment
- HF Dataset setup
- Model repository creation
- Configuration file generation
## Troubleshooting Guidelines
### Common Issues
- CUDA out of memory: Reduce batch size
- Network timeouts: Check internet connection
- Token validation: Verify HF token permissions
- Dataset loading: Check dataset accessibility
### Debugging Steps
1. Check system requirements
2. Validate configuration
3. Test individual components
4. Review logs and error messages
5. Verify external service connectivity
## Future Enhancements
### Planned Features
- Multi-GPU training support
- Advanced dataset sampling strategies
- Automated hyperparameter optimization
- Enhanced monitoring and visualization
- Support for additional model architectures
### Extensibility
- Modular configuration system
- Plugin architecture for custom features
- Support for custom datasets and models
- Flexible monitoring integration
---
**When working with this codebase:**
- Always consider the end-to-end pipeline workflow
- Follow the established configuration patterns
- Include proper error handling and validation
- Maintain comprehensive documentation
- Test changes thoroughly before deployment
- Consider performance implications for different hardware configurations