SmolFactory / docs /DATASET_AUTOMATION_FIX.md
Tonic's picture
adds new hf cli
d291e63 verified
|
raw
history blame
6.77 kB

Dataset Configuration Automation Fix

Problem Description

The original launch script required users to manually specify their username in the dataset repository name, which was:

  1. Error-prone: Users had to remember their username
  2. Inconsistent: Different users might use different naming conventions
  3. Manual: Required extra steps in the setup process

Solution Implementation

Automatic Dataset Repository Creation

We've implemented a Python-based solution that automatically:

  1. Extracts username from token: Uses the HF API to get the username from the validated token
  2. Creates dataset repository: Automatically creates username/trackio-experiments or custom name
  3. Sets environment variables: Automatically configures TRACKIO_DATASET_REPO
  4. Provides customization: Allows users to customize the dataset name if desired

Key Components

1. scripts/dataset_tonic/setup_hf_dataset.py - Main Dataset Setup Script

  • Automatically detects username from HF token
  • Creates dataset repository with proper permissions
  • Supports custom dataset names
  • Sets environment variables for other scripts

2. Updated launch.sh - Enhanced User Experience

  • Automatically creates dataset repository
  • Provides options for default or custom dataset names
  • Fallback to manual input if automatic creation fails
  • Clear user feedback and progress indicators

3. Python API Integration - Consistent Authentication

  • Uses HfApi(token=token) for direct token authentication
  • Avoids environment variable conflicts
  • Consistent error handling across all scripts

Usage Examples

Automatic Dataset Creation (Default)

# The launch script now automatically:
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here

# Creates: username/trackio-experiments
# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments

Custom Dataset Name

# Create with custom name
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments

# Creates: username/my-custom-experiments
# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments

Launch Script Integration

The launch script now provides a seamless experience:

./launch.sh

# Step 3: Experiment Details
# - Automatically creates dataset repository
# - Option to use default or custom name
# - No manual username input required

Features

βœ… Automatic Username Detection

  • Extracts username from HF token using Python API
  • No manual username input required
  • Consistent across all scripts

βœ… Flexible Dataset Naming

  • Default: username/trackio-experiments
  • Custom: username/custom-name
  • User choice during setup

βœ… Robust Error Handling

  • Graceful fallback to manual input
  • Clear error messages
  • Token validation before creation

βœ… Environment Integration

  • Automatically sets TRACKIO_DATASET_REPO
  • Compatible with existing scripts
  • No manual configuration required

βœ… Cross-Platform Compatibility

  • Works on Windows, Linux, macOS
  • Uses Python API instead of CLI
  • Consistent behavior across platforms

Technical Implementation

Token Authentication Flow

# 1. Direct token authentication
api = HfApi(token=token)

# 2. Extract username
user_info = api.whoami()
username = user_info.get("name", user_info.get("username"))

# 3. Create repository
create_repo(
    repo_id=f"{username}/{dataset_name}",
    repo_type="dataset",
    token=token,
    exist_ok=True,
    private=False
)

Launch Script Integration

# Automatic dataset creation
if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
    TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
    print_status "Dataset repository created successfully"
else
    # Fallback to manual input
    get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
fi

User Experience Improvements

Before (Manual Process)

  1. User enters HF token
  2. User manually types username
  3. User manually types dataset repository name
  4. User manually configures environment variables
  5. Risk of typos and inconsistencies

After (Automated Process)

  1. User enters HF token
  2. System automatically detects username
  3. System automatically creates dataset repository
  4. System automatically sets environment variables
  5. Option to customize dataset name if desired

Error Handling

Common Scenarios

Scenario Action User Experience
Valid token βœ… Automatic creation Seamless setup
Invalid token ❌ Clear error message Helpful feedback
Network issues ⚠️ Retry with fallback Graceful degradation
Repository exists ℹ️ Use existing No conflicts

Fallback Mechanisms

  1. Token validation fails: Clear error message with troubleshooting steps
  2. Dataset creation fails: Fallback to manual input
  3. Network issues: Retry with exponential backoff
  4. Permission issues: Clear guidance on token permissions

Benefits

For Users

  • Simplified Setup: No manual username input required
  • Reduced Errors: Automatic username detection eliminates typos
  • Consistent Naming: Standardized repository naming conventions
  • Better UX: Clear progress indicators and feedback

For Developers

  • Maintainable Code: Python API instead of CLI dependencies
  • Cross-Platform: Works consistently across operating systems
  • Extensible: Easy to add new features and customizations
  • Testable: Comprehensive test coverage

For System

  • Reliable: Robust error handling and fallback mechanisms
  • Secure: Direct token authentication without environment conflicts
  • Scalable: Easy to extend for additional repository types
  • Integrated: Seamless integration with existing pipeline

Migration Guide

For Existing Users

No migration required! The system automatically:

  • Detects existing repositories
  • Uses existing repositories if they exist
  • Creates new repositories only when needed

For New Users

The setup is now completely automated:

  1. Run ./launch.sh
  2. Enter your HF token
  3. Choose dataset naming preference
  4. System handles everything else automatically

Future Enhancements

  • Support for organization repositories
  • Multiple dataset repositories per user
  • Dataset repository templates
  • Advanced repository configuration options
  • Repository sharing and collaboration features

Note: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.