SmolFactory / docs /PUSH_SCRIPT_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
9.11 kB
# πŸš€ Push to Hugging Face Script Guide
## Overview
The `push_to_huggingface.py` script has been enhanced to integrate with **HF Datasets** for experiment tracking and provides complete model deployment with persistent experiment storage.
## πŸš€ Key Improvements
### **1. HF Datasets Integration**
- βœ… **Dataset Repository Support**: Configurable dataset repository for experiment storage
- βœ… **Environment Variables**: Automatic detection of `HF_TOKEN` and `TRACKIO_DATASET_REPO`
- βœ… **Enhanced Logging**: Logs push actions to both Trackio and HF Datasets
- βœ… **Model Card Integration**: Includes dataset repository information in model cards
### **2. Enhanced Configuration**
- βœ… **Flexible Token Input**: Multiple ways to provide HF token
- βœ… **Dataset Repository Tracking**: Links models to their experiment datasets
- βœ… **Environment Variable Support**: Fallback to environment variables
- βœ… **Command Line Arguments**: New arguments for HF Datasets integration
### **3. Improved Model Cards**
- βœ… **Dataset Repository Info**: Shows which dataset contains experiment data
- βœ… **Experiment Tracking Section**: Explains how to access training data
- βœ… **Enhanced Documentation**: Better model cards with experiment links
## πŸ“‹ Usage Examples
### **Basic Usage**
```bash
# Push model with default settings
python push_to_huggingface.py /path/to/model username/repo-name
```
### **With HF Datasets Integration**
```bash
# Push model with custom dataset repository
python push_to_huggingface.py /path/to/model username/repo-name \
--dataset-repo username/experiments
```
### **With Custom Token**
```bash
# Push model with custom HF token
python push_to_huggingface.py /path/to/model username/repo-name \
--hf-token your_token_here
```
### **Complete Example**
```bash
# Push model with all options
python push_to_huggingface.py /path/to/model username/repo-name \
--dataset-repo username/experiments \
--hf-token your_token_here \
--private \
--experiment-name "smollm3_finetune_v2"
```
## πŸ”§ Command Line Arguments
| Argument | Required | Default | Description |
|----------|----------|---------|-------------|
| `model_path` | βœ… Yes | None | Path to trained model directory |
| `repo_name` | βœ… Yes | None | HF repository name (username/repo-name) |
| `--token` | ❌ No | `HF_TOKEN` env | Hugging Face token |
| `--hf-token` | ❌ No | `HF_TOKEN` env | HF token (alternative to --token) |
| `--private` | ❌ No | False | Make repository private |
| `--trackio-url` | ❌ No | None | Trackio Space URL for logging |
| `--experiment-name` | ❌ No | None | Experiment name for Trackio |
| `--dataset-repo` | ❌ No | `TRACKIO_DATASET_REPO` env | HF Dataset repository |
## πŸ› οΈ Configuration Methods
### **Method 1: Command Line Arguments**
```bash
python push_to_huggingface.py model_path repo_name \
--dataset-repo username/experiments \
--hf-token your_token_here
```
### **Method 2: Environment Variables**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments
python push_to_huggingface.py model_path repo_name
```
### **Method 3: Hybrid Approach**
```bash
# Set defaults via environment variables
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments
# Override specific values via command line
python push_to_huggingface.py model_path repo_name \
--dataset-repo username/specific-experiments
```
## πŸ“Š What Gets Pushed
### **Model Files**
- βœ… **Model Weights**: `pytorch_model.bin`
- βœ… **Configuration**: `config.json`
- βœ… **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`
- βœ… **All Other Files**: Any additional files in model directory
### **Documentation**
- βœ… **Model Card**: Comprehensive README.md with model information
- βœ… **Training Configuration**: JSON configuration used for training
- βœ… **Training Results**: JSON results and metrics
- βœ… **Training Logs**: Text logs from training process
### **Experiment Data**
- βœ… **Dataset Repository**: Links to HF Dataset containing experiment data
- βœ… **Training Metrics**: All training metrics stored in dataset
- βœ… **Configuration**: Training configuration stored in dataset
- βœ… **Artifacts**: Training artifacts and logs
## πŸ” Enhanced Model Cards
The improved script creates enhanced model cards that include:
### **Model Information**
- Base model and architecture
- Training date and model size
- **Dataset repository** for experiment data
### **Training Configuration**
- Complete training parameters
- Hardware information
- Training duration and steps
### **Experiment Tracking**
- Links to HF Dataset repository
- Instructions for accessing experiment data
- Training metrics and results
### **Usage Examples**
- Code examples for loading and using the model
- Generation examples
- Performance information
## πŸ“ˆ Logging Integration
### **Trackio Logging**
- βœ… **Push Actions**: Logs model push events
- βœ… **Model Information**: Repository name, size, configuration
- βœ… **Training Data**: Links to experiment dataset
### **HF Datasets Logging**
- βœ… **Experiment Summary**: Final training summary
- βœ… **Push Metadata**: Model repository and push date
- βœ… **Configuration**: Complete training configuration
### **Dual Storage**
- βœ… **Trackio**: Real-time monitoring and visualization
- βœ… **HF Datasets**: Persistent experiment storage
- βœ… **Synchronized**: Both systems updated together
## 🚨 Troubleshooting
### **Issue: "Missing required files"**
**Solutions**:
1. Check model directory contains required files
2. Ensure model was saved correctly during training
3. Verify file permissions
### **Issue: "Failed to create repository"**
**Solutions**:
1. Check HF token has write permissions
2. Verify repository name format: `username/repo-name`
3. Ensure repository doesn't already exist (or use `--private`)
### **Issue: "Failed to upload files"**
**Solutions**:
1. Check network connectivity
2. Verify HF token is valid
3. Ensure repository was created successfully
### **Issue: "Dataset repository not found"**
**Solutions**:
1. Check dataset repository exists
2. Verify HF token has read access
3. Use `--dataset-repo` to specify correct repository
## πŸ“‹ Workflow Integration
### **Complete Training Workflow**
1. **Train Model**: Use training scripts with monitoring
2. **Monitor Progress**: View metrics in Trackio interface
3. **Push Model**: Use improved push script
4. **Access Data**: View experiments in HF Dataset repository
### **Example Workflow**
```bash
# 1. Train model with monitoring
python train.py config/train_smollm3_openhermes_fr.py \
--experiment_name "smollm3_french_v2"
# 2. Push model to HF Hub
python push_to_huggingface.py outputs/model username/smollm3-french \
--dataset-repo username/experiments \
--experiment-name "smollm3_french_v2"
# 3. View results
# - Model: https://huggingface.co/username/smollm3-french
# - Experiments: https://huggingface.co/datasets/username/experiments
# - Trackio: Your Trackio Space interface
```
## 🎯 Benefits
### **For Model Deployment**
- βœ… **Complete Documentation**: Enhanced model cards with experiment links
- βœ… **Persistent Storage**: Experiment data stored in HF Datasets
- βœ… **Easy Access**: Direct links to training data and metrics
- βœ… **Reproducibility**: Complete training configuration included
### **For Experiment Management**
- βœ… **Centralized Storage**: All experiments in HF Dataset repository
- βœ… **Version Control**: Model versions linked to experiment data
- βœ… **Collaboration**: Share experiments and models easily
- βœ… **Searchability**: Easy to find specific experiments
### **For Development**
- βœ… **Flexible Configuration**: Multiple ways to set parameters
- βœ… **Backward Compatible**: Works with existing setups
- βœ… **Error Handling**: Clear error messages and troubleshooting
- βœ… **Integration**: Works with existing monitoring system
## πŸ“Š Testing Results
All push script tests passed:
- βœ… **HuggingFacePusher Initialization**: Works with new parameters
- βœ… **Model Card Creation**: Includes HF Datasets integration
- βœ… **Logging Integration**: Logs to both Trackio and HF Datasets
- βœ… **Argument Parsing**: Handles new command line arguments
- βœ… **Environment Variables**: Proper fallback handling
## πŸ”„ Migration Guide
### **From Old Script**
```bash
# Old way
python push_to_huggingface.py model_path repo_name --token your_token
# New way (same functionality)
python push_to_huggingface.py model_path repo_name --hf-token your_token
# New way with HF Datasets
python push_to_huggingface.py model_path repo_name \
--hf-token your_token \
--dataset-repo username/experiments
```
### **Environment Variables**
```bash
# Set environment variables for automatic detection
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=username/experiments
# Then use simple command
python push_to_huggingface.py model_path repo_name
```
---
**πŸŽ‰ Your push script is now fully integrated with HF Datasets for complete experiment tracking and model deployment!**