Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /DATASET_AUTOMATION_FIX.md

Tonic

adds new hf cli

d291e63 verified 3 months ago

preview code

raw

history blame

6.77 kB

	# Dataset Configuration Automation Fix

	## Problem Description

	The original launch script required users to manually specify their username in the dataset repository name, which was:
	1. Error-prone: Users had to remember their username
	2. Inconsistent: Different users might use different naming conventions
	3. Manual: Required extra steps in the setup process

	## Solution Implementation

	### Automatic Dataset Repository Creation

	We've implemented a Python-based solution that automatically:

	1. Extracts username from token: Uses the HF API to get the username from the validated token
	2. Creates dataset repository: Automatically creates `username/trackio-experiments` or custom name
	3. Sets environment variables: Automatically configures `TRACKIO_DATASET_REPO`
	4. Provides customization: Allows users to customize the dataset name if desired

	### Key Components

	#### 1. `scripts/dataset_tonic/setup_hf_dataset.py` - Main Dataset Setup Script
	- Automatically detects username from HF token
	- Creates dataset repository with proper permissions
	- Supports custom dataset names
	- Sets environment variables for other scripts

	#### 2. Updated `launch.sh` - Enhanced User Experience
	- Automatically creates dataset repository
	- Provides options for default or custom dataset names
	- Fallback to manual input if automatic creation fails
	- Clear user feedback and progress indicators

	#### 3. Python API Integration - Consistent Authentication
	- Uses `HfApi(token=token)` for direct token authentication
	- Avoids environment variable conflicts
	- Consistent error handling across all scripts

	## Usage Examples

	### Automatic Dataset Creation (Default)

	```bash
	# The launch script now automatically:
	python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here

	# Creates: username/trackio-experiments
	# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments
	```

	### Custom Dataset Name

	```bash
	# Create with custom name
	python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments

	# Creates: username/my-custom-experiments
	# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments
	```

	### Launch Script Integration

	The launch script now provides a seamless experience:

	```bash
	./launch.sh

	# Step 3: Experiment Details
	# - Automatically creates dataset repository
	# - Option to use default or custom name
	# - No manual username input required
	```

	## Features

	### ✅ Automatic Username Detection
	- Extracts username from HF token using Python API
	- No manual username input required
	- Consistent across all scripts

	### ✅ Flexible Dataset Naming
	- Default: `username/trackio-experiments`
	- Custom: `username/custom-name`
	- User choice during setup

	### ✅ Robust Error Handling
	- Graceful fallback to manual input
	- Clear error messages
	- Token validation before creation

	### ✅ Environment Integration
	- Automatically sets `TRACKIO_DATASET_REPO`
	- Compatible with existing scripts
	- No manual configuration required

	### ✅ Cross-Platform Compatibility
	- Works on Windows, Linux, macOS
	- Uses Python API instead of CLI
	- Consistent behavior across platforms

	## Technical Implementation

	### Token Authentication Flow

	```python
	# 1. Direct token authentication
	api = HfApi(token=token)

	# 2. Extract username
	user_info = api.whoami()
	username = user_info.get("name", user_info.get("username"))

	# 3. Create repository
	create_repo(
	repo_id=f"{username}/{dataset_name}",
	repo_type="dataset",
	token=token,
	exist_ok=True,
	private=False
	)
	```

	### Launch Script Integration

	```bash
	# Automatic dataset creation
	if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
	TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
	print_status "Dataset repository created successfully"
	else
	# Fallback to manual input
	get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
	fi
	```

	## User Experience Improvements

	### Before (Manual Process)
	1. User enters HF token
	2. User manually types username
	3. User manually types dataset repository name
	4. User manually configures environment variables
	5. Risk of typos and inconsistencies

	### After (Automated Process)
	1. User enters HF token
	2. System automatically detects username
	3. System automatically creates dataset repository
	4. System automatically sets environment variables
	5. Option to customize dataset name if desired

	## Error Handling

	### Common Scenarios

	\| Scenario \| Action \| User Experience \|
	\|----------\|--------\|-----------------\|
	\| Valid token \| ✅ Automatic creation \| Seamless setup \|
	\| Invalid token \| ❌ Clear error message \| Helpful feedback \|
	\| Network issues \| ⚠️ Retry with fallback \| Graceful degradation \|
	\| Repository exists \| ℹ️ Use existing \| No conflicts \|

	### Fallback Mechanisms

	1. Token validation fails: Clear error message with troubleshooting steps
	2. Dataset creation fails: Fallback to manual input
	3. Network issues: Retry with exponential backoff
	4. Permission issues: Clear guidance on token permissions

	## Benefits

	### For Users
	- Simplified Setup: No manual username input required
	- Reduced Errors: Automatic username detection eliminates typos
	- Consistent Naming: Standardized repository naming conventions
	- Better UX: Clear progress indicators and feedback

	### For Developers
	- Maintainable Code: Python API instead of CLI dependencies
	- Cross-Platform: Works consistently across operating systems
	- Extensible: Easy to add new features and customizations
	- Testable: Comprehensive test coverage

	### For System
	- Reliable: Robust error handling and fallback mechanisms
	- Secure: Direct token authentication without environment conflicts
	- Scalable: Easy to extend for additional repository types
	- Integrated: Seamless integration with existing pipeline

	## Migration Guide

	### For Existing Users

	No migration required! The system automatically:
	- Detects existing repositories
	- Uses existing repositories if they exist
	- Creates new repositories only when needed

	### For New Users

	The setup is now completely automated:
	1. Run `./launch.sh`
	2. Enter your HF token
	3. Choose dataset naming preference
	4. System handles everything else automatically

	## Future Enhancements

	- [ ] Support for organization repositories
	- [ ] Multiple dataset repositories per user
	- [ ] Dataset repository templates
	- [ ] Advanced repository configuration options
	- [ ] Repository sharing and collaboration features

	---

	Note: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed.