Spaces:
Running
Running
| # π Monitoring Improvements Summary | |
| ## Overview | |
| The monitoring system has been significantly enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments. | |
| ## β Key Improvements Made | |
| ### 1. **Enhanced `monitoring.py`** | |
| - β **HF Datasets Integration**: Added support for saving experiments to HF Datasets repositories | |
| - β **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO` | |
| - β **Fallback Support**: Graceful degradation if HF Datasets unavailable | |
| - β **Dual Storage**: Experiments saved to both Trackio and HF Datasets | |
| - β **Periodic Saving**: Metrics saved to HF Dataset every 10 steps | |
| - β **Error Handling**: Robust error logging and recovery | |
| ### 2. **Updated `train.py`** | |
| - β **Monitoring Integration**: Automatic monitoring setup in training scripts | |
| - β **Configuration Logging**: Experiment configuration logged at start | |
| - β **Training Callbacks**: Monitoring callbacks added to trainer | |
| - β **Summary Logging**: Training summaries logged at completion | |
| - β **Error Logging**: Errors logged to monitoring system | |
| - β **Cleanup**: Proper monitoring session cleanup | |
| ### 3. **Configuration Files Updated** | |
| - β **HF Datasets Config**: Added `hf_token` and `dataset_repo` parameters | |
| - β **Environment Support**: Environment variables automatically detected | |
| - β **Backward Compatible**: Existing configurations still work | |
| ### 4. **New Utility Scripts** | |
| - β **`configure_trackio.py`**: Configuration testing and setup | |
| - β **`integrate_monitoring.py`**: Automated integration script | |
| - β **`test_monitoring_integration.py`**: Comprehensive testing | |
| - β **`setup_hf_dataset.py`**: Dataset repository setup | |
| ### 5. **Documentation** | |
| - β **`MONITORING_INTEGRATION_GUIDE.md`**: Comprehensive usage guide | |
| - β **`ENVIRONMENT_VARIABLES.md`**: Environment variable reference | |
| - β **`HF_DATASETS_GUIDE.md`**: Detailed HF Datasets guide | |
| ## π§ Environment Variables | |
| | Variable | Required | Default | Description | | |
| |----------|----------|---------|-------------| | |
| | `HF_TOKEN` | β Yes | None | Your Hugging Face token | | |
| | `TRACKIO_DATASET_REPO` | β No | `tonic/trackio-experiments` | Dataset repository | | |
| | `TRACKIO_URL` | β No | None | Trackio server URL | | |
| | `TRACKIO_TOKEN` | β No | None | Trackio authentication token | | |
| ## π What Gets Monitored | |
| ### **Training Metrics** | |
| - Loss values (training and validation) | |
| - Learning rate | |
| - Gradient norms | |
| - Training steps and epochs | |
| ### **System Metrics** | |
| - GPU memory usage | |
| - GPU utilization | |
| - CPU usage | |
| - Memory usage | |
| ### **Experiment Data** | |
| - Configuration parameters | |
| - Model checkpoints | |
| - Evaluation results | |
| - Training summaries | |
| ### **Artifacts** | |
| - Configuration files | |
| - Training logs | |
| - Evaluation results | |
| - Model checkpoints | |
| ## π Usage Examples | |
| ### **Basic Training** | |
| ```bash | |
| # Set environment variables | |
| export HF_TOKEN=your_token_here | |
| export TRACKIO_DATASET_REPO=your-username/experiments | |
| # Run training with monitoring | |
| python train.py config/train_smollm3_openhermes_fr.py | |
| ``` | |
| ### **Advanced Configuration** | |
| ```bash | |
| # Train with custom settings | |
| python train.py config/train_smollm3_openhermes_fr.py \ | |
| --experiment_name "smollm3_french_v2" \ | |
| --hf_token your_token_here \ | |
| --dataset_repo your-username/french-experiments | |
| ``` | |
| ### **Testing Setup** | |
| ```bash | |
| # Test configuration | |
| python configure_trackio.py | |
| # Test monitoring integration | |
| python test_monitoring_integration.py | |
| # Test dataset access | |
| python test_hf_datasets.py | |
| ``` | |
| ## π Benefits | |
| ### **For HF Spaces Deployment** | |
| - β **Persistent Storage**: Data survives Space restarts | |
| - β **No Local Storage**: No dependency on ephemeral storage | |
| - β **Scalable**: Works with any dataset size | |
| - β **Secure**: Private dataset storage | |
| ### **For Experiment Management** | |
| - β **Centralized**: All experiments in one place | |
| - β **Searchable**: Easy to find specific experiments | |
| - β **Versioned**: Dataset versioning for experiments | |
| - β **Collaborative**: Share experiments with team | |
| ### **For Development** | |
| - β **Flexible**: Easy to switch between datasets | |
| - β **Configurable**: Environment-based configuration | |
| - β **Robust**: Fallback mechanisms | |
| - β **Debuggable**: Comprehensive logging | |
| ## π§ͺ Testing Results | |
| All monitoring integration tests passed: | |
| - β Module Import | |
| - β Monitor Creation | |
| - β Config Creation | |
| - β Metrics Logging | |
| - β Configuration Logging | |
| - β System Metrics | |
| - β Training Summary | |
| - β Callback Creation | |
| ## π Files Modified/Created | |
| ### **Core Files** | |
| - `monitoring.py` - Enhanced with HF Datasets support | |
| - `train.py` - Updated with monitoring integration | |
| - `requirements_core.txt` - Added monitoring dependencies | |
| - `requirements_space.txt` - Updated for HF Spaces | |
| ### **Configuration Files** | |
| - `config/train_smollm3.py` - Added HF Datasets config | |
| - `config/train_smollm3_openhermes_fr.py` - Added HF Datasets config | |
| - `config/train_smollm3_openhermes_fr_a100_balanced.py` - Added HF Datasets config | |
| - `config/train_smollm3_openhermes_fr_a100_large.py` - Added HF Datasets config | |
| - `config/train_smollm3_openhermes_fr_a100_max_performance.py` - Added HF Datasets config | |
| - `config/train_smollm3_openhermes_fr_a100_multiple_passes.py` - Added HF Datasets config | |
| ### **New Utility Scripts** | |
| - `configure_trackio.py` - Configuration testing | |
| - `integrate_monitoring.py` - Automated integration | |
| - `test_monitoring_integration.py` - Comprehensive testing | |
| - `setup_hf_dataset.py` - Dataset setup | |
| ### **Documentation** | |
| - `MONITORING_INTEGRATION_GUIDE.md` - Usage guide | |
| - `ENVIRONMENT_VARIABLES.md` - Environment reference | |
| - `HF_DATASETS_GUIDE.md` - HF Datasets guide | |
| - `MONITORING_IMPROVEMENTS_SUMMARY.md` - This summary | |
| ## π― Next Steps | |
| 1. **Set up your HF token and dataset repository** | |
| 2. **Test the configuration with `python configure_trackio.py`** | |
| 3. **Run a training experiment to verify full functionality** | |
| 4. **Check your HF Dataset repository for experiment data** | |
| 5. **View results in your Trackio interface** | |
| ## π Troubleshooting | |
| ### **Common Issues** | |
| - **HF_TOKEN not set**: Set your Hugging Face token | |
| - **Dataset access failed**: Check token permissions and repository existence | |
| - **Monitoring not working**: Run `python test_monitoring_integration.py` to diagnose | |
| ### **Getting Help** | |
| - Check the comprehensive guides in the documentation files | |
| - Run the test scripts to verify your setup | |
| - Check logs for specific error messages | |
| --- | |
| **π The monitoring system is now ready for production use with persistent HF Datasets storage!** |