Spaces:
Running
Running
File size: 6,767 Bytes
d291e63 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
# Dataset Configuration Automation Fix
## Problem Description
The original launch script required users to manually specify their username in the dataset repository name, which was:
1. **Error-prone**: Users had to remember their username
2. **Inconsistent**: Different users might use different naming conventions
3. **Manual**: Required extra steps in the setup process
## Solution Implementation
### Automatic Dataset Repository Creation
We've implemented a Python-based solution that automatically:
1. **Extracts username from token**: Uses the HF API to get the username from the validated token
2. **Creates dataset repository**: Automatically creates `username/trackio-experiments` or custom name
3. **Sets environment variables**: Automatically configures `TRACKIO_DATASET_REPO`
4. **Provides customization**: Allows users to customize the dataset name if desired
### Key Components
#### 1. **`scripts/dataset_tonic/setup_hf_dataset.py`** - Main Dataset Setup Script
- Automatically detects username from HF token
- Creates dataset repository with proper permissions
- Supports custom dataset names
- Sets environment variables for other scripts
#### 2. **Updated `launch.sh`** - Enhanced User Experience
- Automatically creates dataset repository
- Provides options for default or custom dataset names
- Fallback to manual input if automatic creation fails
- Clear user feedback and progress indicators
#### 3. **Python API Integration** - Consistent Authentication
- Uses `HfApi(token=token)` for direct token authentication
- Avoids environment variable conflicts
- Consistent error handling across all scripts
## Usage Examples
### Automatic Dataset Creation (Default)
```bash
# The launch script now automatically:
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here
# Creates: username/trackio-experiments
# Sets: TRACKIO_DATASET_REPO=username/trackio-experiments
```
### Custom Dataset Name
```bash
# Create with custom name
python scripts/dataset_tonic/setup_hf_dataset.py hf_your_token_here my-custom-experiments
# Creates: username/my-custom-experiments
# Sets: TRACKIO_DATASET_REPO=username/my-custom-experiments
```
### Launch Script Integration
The launch script now provides a seamless experience:
```bash
./launch.sh
# Step 3: Experiment Details
# - Automatically creates dataset repository
# - Option to use default or custom name
# - No manual username input required
```
## Features
### β
**Automatic Username Detection**
- Extracts username from HF token using Python API
- No manual username input required
- Consistent across all scripts
### β
**Flexible Dataset Naming**
- Default: `username/trackio-experiments`
- Custom: `username/custom-name`
- User choice during setup
### β
**Robust Error Handling**
- Graceful fallback to manual input
- Clear error messages
- Token validation before creation
### β
**Environment Integration**
- Automatically sets `TRACKIO_DATASET_REPO`
- Compatible with existing scripts
- No manual configuration required
### β
**Cross-Platform Compatibility**
- Works on Windows, Linux, macOS
- Uses Python API instead of CLI
- Consistent behavior across platforms
## Technical Implementation
### Token Authentication Flow
```python
# 1. Direct token authentication
api = HfApi(token=token)
# 2. Extract username
user_info = api.whoami()
username = user_info.get("name", user_info.get("username"))
# 3. Create repository
create_repo(
repo_id=f"{username}/{dataset_name}",
repo_type="dataset",
token=token,
exist_ok=True,
private=False
)
```
### Launch Script Integration
```bash
# Automatic dataset creation
if python3 scripts/dataset_tonic/setup_hf_dataset.py 2>/dev/null; then
TRACKIO_DATASET_REPO="$TRACKIO_DATASET_REPO"
print_status "Dataset repository created successfully"
else
# Fallback to manual input
get_input "Trackio dataset repository" "$HF_USERNAME/trackio-experiments" TRACKIO_DATASET_REPO
fi
```
## User Experience Improvements
### Before (Manual Process)
1. User enters HF token
2. User manually types username
3. User manually types dataset repository name
4. User manually configures environment variables
5. Risk of typos and inconsistencies
### After (Automated Process)
1. User enters HF token
2. System automatically detects username
3. System automatically creates dataset repository
4. System automatically sets environment variables
5. Option to customize dataset name if desired
## Error Handling
### Common Scenarios
| Scenario | Action | User Experience |
|----------|--------|-----------------|
| Valid token | β
Automatic creation | Seamless setup |
| Invalid token | β Clear error message | Helpful feedback |
| Network issues | β οΈ Retry with fallback | Graceful degradation |
| Repository exists | βΉοΈ Use existing | No conflicts |
### Fallback Mechanisms
1. **Token validation fails**: Clear error message with troubleshooting steps
2. **Dataset creation fails**: Fallback to manual input
3. **Network issues**: Retry with exponential backoff
4. **Permission issues**: Clear guidance on token permissions
## Benefits
### For Users
- **Simplified Setup**: No manual username input required
- **Reduced Errors**: Automatic username detection eliminates typos
- **Consistent Naming**: Standardized repository naming conventions
- **Better UX**: Clear progress indicators and feedback
### For Developers
- **Maintainable Code**: Python API instead of CLI dependencies
- **Cross-Platform**: Works consistently across operating systems
- **Extensible**: Easy to add new features and customizations
- **Testable**: Comprehensive test coverage
### For System
- **Reliable**: Robust error handling and fallback mechanisms
- **Secure**: Direct token authentication without environment conflicts
- **Scalable**: Easy to extend for additional repository types
- **Integrated**: Seamless integration with existing pipeline
## Migration Guide
### For Existing Users
No migration required! The system automatically:
- Detects existing repositories
- Uses existing repositories if they exist
- Creates new repositories only when needed
### For New Users
The setup is now completely automated:
1. Run `./launch.sh`
2. Enter your HF token
3. Choose dataset naming preference
4. System handles everything else automatically
## Future Enhancements
- [ ] Support for organization repositories
- [ ] Multiple dataset repositories per user
- [ ] Dataset repository templates
- [ ] Advanced repository configuration options
- [ ] Repository sharing and collaboration features
---
**Note**: This automation ensures that users can focus on their fine-tuning experiments rather than repository setup details, while maintaining full flexibility for customization when needed. |