Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /DATASET_AUTOMATION_FIX.md

Tonic

adds new hf cli

d291e63 verified about 2 months ago

preview code

raw

history blame

6.77 kB

Dataset Configuration Automation Fix

Problem Description

The original launch script required users to manually specify their username in the dataset repository name, which was:

Error-prone: Users had to remember their username
Inconsistent: Different users might use different naming conventions
Manual: Required extra steps in the setup process

Solution Implementation

Automatic Dataset Repository Creation

We've implemented a Python-based solution that automatically:

Extracts username from token: Uses the HF API to get the username from the validated token
Creates dataset repository: Automatically creates username/trackio-experiments or custom name
Sets environment variables: Automatically configures TRACKIO_DATASET_REPO
Provides customization: Allows users to customize the dataset name if desired

Key Components

1. `scripts/dataset_tonic/setup_hf_dataset.py` - Main Dataset Setup Script

Automatically detects username from HF token
Creates dataset repository with proper permissions
Supports custom dataset names
Sets environment variables for other scripts

2. Updated `launch.sh` - Enhanced User Experience

Automatically creates dataset repository
Provides options for default or custom dataset names
Fallback to manual input if automatic creation fails
Clear user feedback and progress indicators

3. Python API Integration - Consistent Authentication

Uses HfApi(token=token) for direct token authentication
Avoids environment variable conflicts
Consistent error handling across all scripts

Usage Examples

Automatic Dataset Creation (Default)

# The launch script now automatically:
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here

# Creates: username/trackio-experiments
# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments

Custom Dataset Name

# Create with custom name
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments

# Creates: username/my-custom-experiments
# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments

Launch Script Integration

The launch script now provides a seamless experience:

./launch.sh

# Step 3: Experiment Details
# - Automatically creates dataset repository
# - Option to use default or custom name
# - No manual username input required

Features

✅ Automatic Username Detection

Extracts username from HF token using Python API
No manual username input required
Consistent across all scripts

✅ Flexible Dataset Naming

Default: username/trackio-experiments
Custom: username/custom-name
User choice during setup

✅ Robust Error Handling

Graceful fallback to manual input
Clear error messages
Token validation before creation

✅ Environment Integration

Automatically sets TRACKIO_DATASET_REPO
Compatible with existing scripts
No manual configuration required

✅ Cross-Platform Compatibility

Works on Windows, Linux, macOS
Uses Python API instead of CLI
Consistent behavior across platforms

Technical Implementation

Token Authentication Flow

# 1. Direct token authentication
api = HfApi(token=token)

# 2. Extract username
user_info = api.whoami()
username = user_info.get("name", user_info.get("username"))

# 3. Create repository
create_repo(
    repo_id=f"{username}/{dataset_name}",
    repo_type="dataset",
    token=token,
    exist_ok=True,
    private=False
)

Launch Script Integration

# Automatic dataset creation
if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
    TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
    print_status "Dataset repository created successfully"
else
    # Fallback to manual input
    get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
fi

User Experience Improvements

Before (Manual Process)

User enters HF token
User manually types username
User manually types dataset repository name
User manually configures environment variables
Risk of typos and inconsistencies

After (Automated Process)

User enters HF token
System automatically detects username
System automatically creates dataset repository
System automatically sets environment variables
Option to customize dataset name if desired

Error Handling

Common Scenarios

Scenario	Action	User Experience
Valid token	✅ Automatic creation	Seamless setup
Invalid token	❌ Clear error message	Helpful feedback
Network issues	⚠️ Retry with fallback	Graceful degradation
Repository exists	ℹ️ Use existing	No conflicts

Fallback Mechanisms

Token validation fails: Clear error message with troubleshooting steps
Dataset creation fails: Fallback to manual input
Network issues: Retry with exponential backoff
Permission issues: Clear guidance on token permissions

Benefits

For Users

Simplified Setup: No manual username input required
Reduced Errors: Automatic username detection eliminates typos
Consistent Naming: Standardized repository naming conventions
Better UX: Clear progress indicators and feedback

For Developers

Maintainable Code: Python API instead of CLI dependencies
Cross-Platform: Works consistently across operating systems
Extensible: Easy to add new features and customizations
Testable: Comprehensive test coverage

For System

Reliable: Robust error handling and fallback mechanisms
Secure: Direct token authentication without environment conflicts
Scalable: Easy to extend for additional repository types
Integrated: Seamless integration with existing pipeline

Migration Guide

For Existing Users

No migration required! The system automatically:

Detects existing repositories
Uses existing repositories if they exist
Creates new repositories only when needed

For New Users

The setup is now completely automated:

Run ./launch.sh
Enter your HF token
Choose dataset naming preference
System handles everything else automatically

Future Enhancements

Support for organization repositories
Multiple dataset repositories per user
Dataset repository templates
Advanced repository configuration options
Repository sharing and collaboration features

Note: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.

Dataset Configuration Automation Fix

Problem Description

Solution Implementation

Automatic Dataset Repository Creation

Key Components

1. scripts/dataset_tonic/setup_hf_dataset.py - Main Dataset Setup Script

2. Updated launch.sh - Enhanced User Experience

3. Python API Integration - Consistent Authentication

Usage Examples

Automatic Dataset Creation (Default)

Custom Dataset Name

Launch Script Integration

Features

✅ Automatic Username Detection

✅ Flexible Dataset Naming

✅ Robust Error Handling

✅ Environment Integration

✅ Cross-Platform Compatibility

Technical Implementation

Token Authentication Flow

Launch Script Integration

User Experience Improvements

Before (Manual Process)

After (Automated Process)

Error Handling

Common Scenarios

Fallback Mechanisms

Benefits

For Users

For Developers

For System

Migration Guide

For Existing Users

For New Users

Future Enhancements

1. `scripts/dataset_tonic/setup_hf_dataset.py` - Main Dataset Setup Script

2. Updated `launch.sh` - Enhanced User Experience