SmolFactory / docs /INTERACTIVE_PIPELINE_IMPROVEMENTS.md
Tonic's picture
merge tonic into main for major refactor
42f4411 verified
|
raw
history blame
8.64 kB

Interactive Pipeline Improvements

This document explains the improvements made to the launch.sh script to make it interactive and configurable for different training scenarios.

🎯 Key Improvements

1. Interactive User Interface

  • Colored Output: Added color-coded status messages for better UX
  • Input Validation: Real-time validation of user inputs
  • Default Values: Smart defaults for common configurations
  • Error Handling: Graceful error handling with helpful messages

2. Training Configuration Selection

The script now offers 4 predefined training configurations:

Basic Training (Default)

Model: SmolLM3-3B
Dataset: SmolTalk
Epochs: 3
Batch Size: 2
Learning Rate: 5e-6
Sequence Length: 4096
Best for: Quick experiments, learning

H100 Lightweight (Rapid)

Model: SmolLM3-3B
Dataset: OpenHermes-FR (80K samples)
Epochs: 1
Batch Size: 16
Learning Rate: 8e-6
Sequence Length: 8192
Best for: Rapid training on H100

A100 Large Scale

Model: SmolLM3-3B
Dataset: OpenHermes-FR
Epochs: 1.3 passes
Batch Size: 8
Learning Rate: 5e-6
Sequence Length: 8192
Best for: High-performance training

Multiple Passes

Model: SmolLM3-3B
Dataset: OpenHermes-FR
Epochs: 4 passes
Batch Size: 6
Learning Rate: 3e-6
Sequence Length: 8192
Best for: Thorough training

Custom Configuration

  • User-defined parameters
  • Flexible model and dataset selection
  • Custom training parameters

3. Enhanced User Experience

Step-by-Step Guidance

  1. Authentication - HF username and token validation
  2. Configuration Selection - Choose from predefined configs
  3. Experiment Setup - Configure experiment details
  4. Training Parameters - Adjust hyperparameters
  5. Deployment Setup - Trackio Space configuration
  6. Confirmation - Review and confirm settings

Input Functions

# Get input with default value
get_input "Prompt" "default_value" VARIABLE_NAME

# Select from options
select_option "Choose option:" "Option 1" "Option 2" "Option 3" VARIABLE_NAME

# Validate HF token
validate_hf_token "$HF_TOKEN"

Colored Output Functions

print_status "Success message"    # Green βœ…
print_warning "Warning message"   # Yellow ⚠️
print_error "Error message"       # Red ❌
print_info "Info message"         # Blue ℹ️
print_header "Header message"     # Purple πŸš€
print_step "Step message"         # Cyan πŸ“‹

4. Dynamic Configuration Generation

The script now generates training configurations based on user selection:

# Generated config file
config = SmolLM3Config(
    model_name="$MODEL_NAME",
    max_seq_length=$MAX_SEQ_LENGTH,
    batch_size=$BATCH_SIZE,
    learning_rate=$LEARNING_RATE,
    # ... other parameters
)

5. Improved Error Handling

Input Validation

  • Required field validation
  • HF token validation
  • Numeric input validation
  • Choice validation

Graceful Degradation

  • Clear error messages
  • Recovery suggestions
  • Exit on critical errors

6. Configuration Management

User Credentials

  • Interactive username input
  • Secure token input
  • Real-time token validation

Experiment Details

  • Dynamic experiment naming
  • Repository name generation
  • Dataset repository configuration

Training Parameters

  • Batch size selection
  • Learning rate adjustment
  • Sequence length configuration
  • Save/eval/logging steps

7. Enhanced Monitoring Integration

Trackio Space

  • Dynamic space naming
  • Automatic deployment
  • URL generation

HF Datasets

  • Dataset repository setup
  • Experiment data storage
  • Access configuration

πŸ”§ Technical Improvements

1. Modular Functions

# Input handling
get_input()          # Get user input with defaults
select_option()      # Select from options
validate_hf_token()  # Validate HF token

# Configuration
show_training_configs()    # Display available configs
get_training_config()      # Get config based on selection
create_training_config()   # Generate config file

# Output formatting
print_status()       # Success messages
print_warning()      # Warning messages
print_error()        # Error messages
print_info()         # Info messages
print_header()       # Header messages
print_step()         # Step messages

2. Configuration Selection Logic

case "$config_type" in
    "Basic Training")
        MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
        DATASET_NAME="HuggingFaceTB/smoltalk"
        # ... other parameters
        ;;
    "A100 Large Scale")
        MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
        DATASET_NAME="legmlai/openhermes-fr"
        # ... other parameters
        ;;
    # ... other configurations
esac

3. Dynamic File Generation

# Generate training config
create_training_config "$CONFIG_FILE"

# Generate deployment input
cat > deploy_input.txt << EOF
$HF_USERNAME
$TRACKIO_SPACE_NAME
$HF_TOKEN
EOF

πŸ“Š User Workflow

Before (Static)

  1. Edit launch.sh manually
  2. Update hardcoded variables
  3. Run script
  4. Hope configuration is correct

After (Interactive)

  1. Run ./launch.sh
  2. Follow interactive prompts
  3. Select training configuration
  4. Confirm settings
  5. Watch automated pipeline

🎯 Benefits

For Users

  • No Manual Editing: No need to edit script files
  • Guided Experience: Step-by-step prompts
  • Validation: Real-time input validation
  • Flexibility: Multiple configuration options
  • Safety: Confirmation before execution

For Developers

  • Maintainable: Modular function structure
  • Extensible: Easy to add new configurations
  • Robust: Comprehensive error handling
  • User-Friendly: Clear feedback and guidance

For Different Use Cases

  • Beginners: Basic Training configuration
  • H100 Users: H100 Lightweight for rapid experiments
  • Researchers: A100 Large Scale for serious experiments
  • Production: Multiple Passes for thorough training
  • Custom: User-defined parameters for specific needs

πŸ”„ Configuration Examples

Quick Start (Basic Training)

./launch.sh
# Follow prompts:
# 1. Enter HF username and token
# 2. Select "Basic Training"
# 3. Confirm settings
# 4. Watch automated pipeline

High-Performance Training (A100)

./launch.sh
# Follow prompts:
# 1. Enter HF username and token
# 2. Select "A100 Large Scale"
# 3. Adjust parameters if needed
# 4. Confirm and run

Rapid Training (H100)

./launch.sh
# Follow prompts:
# 1. Enter HF username and token
# 2. Select "H100 Lightweight (Rapid)"
# 3. Confirm settings
# 4. Watch rapid training on H100

Custom Training

./launch.sh
# Follow prompts:
# 1. Enter HF username and token
# 2. Select "Custom Configuration"
# 3. Enter custom parameters:
#    - Model: microsoft/DialoGPT-medium
#    - Dataset: your-custom-dataset
#    - Epochs: 5
#    - Batch Size: 4
#    - Learning Rate: 1e-5
# 4. Confirm and run

πŸš€ Future Enhancements

Planned Improvements

  • GUI Interface: Web-based configuration interface
  • Configuration Templates: Save/load custom configurations
  • Advanced Validation: More sophisticated input validation
  • Progress Tracking: Real-time progress indicators
  • Rollback Capability: Undo changes if needed

Extensibility

  • Plugin System: Add custom training configurations
  • API Integration: Connect to external services
  • Multi-GPU Support: Distributed training options
  • Advanced Monitoring: Enhanced tracking capabilities

πŸ“‹ Migration Guide

For Existing Users

  1. Backup: Save your current launch.sh
  2. Update: Replace with new interactive version
  3. Test: Run with basic configuration first
  4. Migrate: Use interactive prompts instead of manual editing

For New Users

  1. Setup: Run python setup_launch.py
  2. Check: Run python check_requirements.py
  3. Launch: Run ./launch.sh
  4. Follow: Use interactive prompts

πŸŽ‰ Conclusion

The interactive pipeline provides a much better user experience with:

  • Guided Configuration: No manual editing required
  • Multiple Options: Predefined configurations for different use cases
  • Validation: Real-time input validation and error handling
  • Flexibility: Custom configuration support
  • Safety: Confirmation steps and error recovery

The script is now production-ready for users of all skill levels, from beginners to advanced researchers.