Spaces:
Running
Running
Dataset Configuration Automation Fix
Problem Description
The original launch script required users to manually specify their username in the dataset repository name, which was:
- Error-prone: Users had to remember their username
- Inconsistent: Different users might use different naming conventions
- Manual: Required extra steps in the setup process
Solution Implementation
Automatic Dataset Repository Creation
We've implemented a Python-based solution that automatically:
- Extracts username from token: Uses the HF API to get the username from the validated token
- Creates dataset repository: Automatically creates
username/trackio-experiments
or custom name - Sets environment variables: Automatically configures
TRACKIO_DATASET_REPO
- Provides customization: Allows users to customize the dataset name if desired
Key Components
1. scripts/dataset_tonic/setup_hf_dataset.py
- Main Dataset Setup Script
- Automatically detects username from HF token
- Creates dataset repository with proper permissions
- Supports custom dataset names
- Sets environment variables for other scripts
2. Updated launch.sh
- Enhanced User Experience
- Automatically creates dataset repository
- Provides options for default or custom dataset names
- Fallback to manual input if automatic creation fails
- Clear user feedback and progress indicators
3. Python API Integration - Consistent Authentication
- Uses
HfApi(token=token)
for direct token authentication - Avoids environment variable conflicts
- Consistent error handling across all scripts
Usage Examples
Automatic Dataset Creation (Default)
# The launch script now automatically:
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here
# Creates: username/trackio-experiments
# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments
Custom Dataset Name
# Create with custom name
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments
# Creates: username/my-custom-experiments
# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments
Launch Script Integration
The launch script now provides a seamless experience:
./launch.sh
# Step 3: Experiment Details
# - Automatically creates dataset repository
# - Option to use default or custom name
# - No manual username input required
Features
β Automatic Username Detection
- Extracts username from HF token using Python API
- No manual username input required
- Consistent across all scripts
β Flexible Dataset Naming
- Default:
username/trackio-experiments
- Custom:
username/custom-name
- User choice during setup
β Robust Error Handling
- Graceful fallback to manual input
- Clear error messages
- Token validation before creation
β Environment Integration
- Automatically sets
TRACKIO_DATASET_REPO
- Compatible with existing scripts
- No manual configuration required
β Cross-Platform Compatibility
- Works on Windows, Linux, macOS
- Uses Python API instead of CLI
- Consistent behavior across platforms
Technical Implementation
Token Authentication Flow
# 1. Direct token authentication
api = HfApi(token=token)
# 2. Extract username
user_info = api.whoami()
username = user_info.get("name", user_info.get("username"))
# 3. Create repository
create_repo(
repo_id=f"{username}/{dataset_name}",
repo_type="dataset",
token=token,
exist_ok=True,
private=False
)
Launch Script Integration
# Automatic dataset creation
if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
print_status "Dataset repository created successfully"
else
# Fallback to manual input
get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
fi
User Experience Improvements
Before (Manual Process)
- User enters HF token
- User manually types username
- User manually types dataset repository name
- User manually configures environment variables
- Risk of typos and inconsistencies
After (Automated Process)
- User enters HF token
- System automatically detects username
- System automatically creates dataset repository
- System automatically sets environment variables
- Option to customize dataset name if desired
Error Handling
Common Scenarios
Scenario | Action | User Experience |
---|---|---|
Valid token | β Automatic creation | Seamless setup |
Invalid token | β Clear error message | Helpful feedback |
Network issues | β οΈ Retry with fallback | Graceful degradation |
Repository exists | βΉοΈ Use existing | No conflicts |
Fallback Mechanisms
- Token validation fails: Clear error message with troubleshooting steps
- Dataset creation fails: Fallback to manual input
- Network issues: Retry with exponential backoff
- Permission issues: Clear guidance on token permissions
Benefits
For Users
- Simplified Setup: No manual username input required
- Reduced Errors: Automatic username detection eliminates typos
- Consistent Naming: Standardized repository naming conventions
- Better UX: Clear progress indicators and feedback
For Developers
- Maintainable Code: Python API instead of CLI dependencies
- Cross-Platform: Works consistently across operating systems
- Extensible: Easy to add new features and customizations
- Testable: Comprehensive test coverage
For System
- Reliable: Robust error handling and fallback mechanisms
- Secure: Direct token authentication without environment conflicts
- Scalable: Easy to extend for additional repository types
- Integrated: Seamless integration with existing pipeline
Migration Guide
For Existing Users
No migration required! The system automatically:
- Detects existing repositories
- Uses existing repositories if they exist
- Creates new repositories only when needed
For New Users
The setup is now completely automated:
- Run
./launch.sh
- Enter your HF token
- Choose dataset naming preference
- System handles everything else automatically
Future Enhancements
- Support for organization repositories
- Multiple dataset repositories per user
- Dataset repository templates
- Advanced repository configuration options
- Repository sharing and collaboration features
Note: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.