Tonic commited on
Commit
769bb84
Β·
verified Β·
1 Parent(s): ca1f1cd

fix launch script

Browse files
.cursorrules DELETED
@@ -1,277 +0,0 @@
1
- ---
2
- description: SmolLM3 Fine-tuning Pipeline - Project Rules and Conventions
3
- globs: ["**/*.py", "**/*.sh", "**/*.md", "**/*.json"]
4
- alwaysApply: true
5
- ---
6
-
7
- # SmolLM3 Fine-tuning Pipeline Project Rules
8
-
9
- ## Project Overview
10
- This is a comprehensive end-to-end fine-tuning pipeline for SmolLM3 models with Trackio monitoring, Hugging Face integration, and interactive configuration management.
11
-
12
- ## Core Architecture
13
-
14
- ### Directory Structure
15
- - `config/` - Training configuration files for different scenarios
16
- - `src/` - Core training and model logic
17
- - `scripts/` - Utility scripts for deployment, dataset management, and model pushing
18
- - `docs/` - Comprehensive documentation and guides
19
- - `templates/` - Templates for HF Spaces and datasets
20
- - `tests/` - Test files and debugging scripts
21
- - `outputs/` - Training outputs and checkpoints
22
-
23
- ### Key Components
24
-
25
- #### Training Configurations
26
- - **Basic Training**: SmolLM3-3B + OpenHermes-FR, 3 epochs, batch size 2
27
- - **H100 Lightweight**: SmolLM3-3B + OpenHermes-FR (80K samples), 1 epoch, batch size 16
28
- - **A100 Large Scale**: SmolLM3-3B + OpenHermes-FR, 1.3 passes, batch size 8
29
- - **Multiple Passes**: SmolLM3-3B + OpenHermes-FR, 4 epochs, batch size 6
30
- - **Custom Configuration**: User-defined parameters
31
-
32
- #### Core Modules
33
- - `src/train.py` - Main training orchestration
34
- - `src/model.py` - Model loading and configuration
35
- - `src/data.py` - Dataset processing and loading
36
- - `src/monitoring.py` - Trackio integration and metrics
37
- - `src/trainer.py` - Training loop and optimization
38
-
39
- ## Coding Conventions
40
-
41
- ### Python Style
42
- - Use type hints for all function parameters and return values
43
- - Follow PEP 8 for formatting
44
- - Use descriptive variable names in snake_case
45
- - Add comprehensive docstrings for all functions
46
- - Use f-strings for string formatting
47
-
48
- ### Configuration Management
49
- - All training configs inherit from `SmolLM3Config` base class
50
- - Use dataclasses for configuration objects
51
- - Validate configuration parameters in __post_init__
52
- - Support both YAML and Python configuration files
53
-
54
- ### Error Handling
55
- - Use try-except blocks for external API calls (HF, Trackio)
56
- - Log errors with appropriate context
57
- - Provide user-friendly error messages
58
- - Implement graceful degradation for optional features
59
-
60
- ### Monitoring Integration
61
- - Always include Trackio URL and experiment name in configs
62
- - Log metrics every N steps (configurable)
63
- - Save checkpoints and artifacts to HF Datasets
64
- - Use structured logging with consistent field names
65
-
66
- ## File Naming Conventions
67
-
68
- ### Configuration Files
69
- - `train_smollm3_*.py` - Training configurations
70
- - `*_config.py` - General configuration files
71
- - Use descriptive suffixes: `_h100_lightweight`, `_a100_large`, `_multiple_passes`
72
-
73
- ### Script Files
74
- - `deploy_*.py` - Deployment scripts
75
- - `setup_*.py` - Setup and initialization scripts
76
- - `push_*.py` - Model pushing scripts
77
- - `configure_*.py` - Configuration scripts
78
-
79
- ### Test Files
80
- - `test_*.py` - Test files
81
- - `debug_*.py` - Debugging scripts
82
- - Include descriptive names indicating what they test
83
-
84
- ## Training Pipeline Workflow
85
-
86
- ### Interactive Pipeline (`launch.sh`)
87
- 1. **Authentication**: HF username and token validation
88
- 2. **Configuration Selection**: Choose from predefined configs or custom
89
- 3. **Experiment Setup**: Configure experiment name and repositories
90
- 4. **Environment Setup**: Install dependencies and setup virtual environment
91
- 5. **Deployment**: Deploy Trackio Space and setup HF Dataset
92
- 6. **Training**: Execute training with monitoring
93
- 7. **Model Push**: Upload model to HF Hub with documentation
94
- 8. **Testing**: Validate uploaded model functionality
95
-
96
- ### Configuration Selection Logic
97
- - Basic Training: Default for beginners and learning
98
- - H100 Lightweight: Rapid experiments on H100 GPUs
99
- - A100 Large Scale: Serious research and production
100
- - Multiple Passes: Thorough training for production models
101
- - Custom: User-defined parameters for specific needs
102
-
103
- ## Dataset Management
104
-
105
- ### Supported Formats
106
- - Hugging Face Datasets format
107
- - JSON files with prompt/completion pairs
108
- - Chat format with messages array
109
- - Custom formats with conversion functions
110
-
111
- ### Dataset Processing
112
- - Automatic format detection and conversion
113
- - Random sampling for lightweight configurations
114
- - Validation split creation
115
- - Bad entry filtering and handling
116
-
117
- ### Dataset Sampling (H100 Lightweight)
118
- - 80,000 random samples from OpenHermes-FR
119
- - 1,000 validation samples (if available)
120
- - Fixed random seed (42) for reproducibility
121
- - Automatic sampling during dataset preparation
122
-
123
- ## Model Management
124
-
125
- ### Model Loading
126
- - Support for HuggingFaceTB/SmolLM3-3B
127
- - Flash attention and gradient checkpointing
128
- - Mixed precision training (fp16/bf16)
129
- - Device mapping and memory optimization
130
-
131
- ### Model Pushing
132
- - Comprehensive model cards with training details
133
- - Automatic README generation
134
- - License and usage information
135
- - Training metrics and configuration
136
-
137
- ## Monitoring and Tracking
138
-
139
- ### Trackio Integration
140
- - Real-time metrics logging
141
- - Training curves visualization
142
- - Resource usage monitoring
143
- - Artifact storage and versioning
144
-
145
- ### Metrics to Track
146
- - Training and validation loss
147
- - Learning rate schedule
148
- - Gradient norms
149
- - GPU utilization and memory
150
- - Training speed (steps/second)
151
-
152
- ## Error Handling and Validation
153
-
154
- ### Input Validation
155
- - Validate HF tokens before use
156
- - Check CUDA availability
157
- - Verify dataset accessibility
158
- - Validate configuration parameters
159
-
160
- ### Error Recovery
161
- - Graceful handling of network issues
162
- - Automatic retry for failed operations
163
- - Checkpoint recovery for interrupted training
164
- - Fallback options for optional features
165
-
166
- ## Documentation Standards
167
-
168
- ### README Files
169
- - Clear project description
170
- - Installation instructions
171
- - Usage examples
172
- - Configuration options
173
- - Troubleshooting guide
174
-
175
- ### Code Documentation
176
- - Comprehensive docstrings
177
- - Type hints for all functions
178
- - Example usage in docstrings
179
- - Parameter descriptions
180
- - Return value documentation
181
-
182
- ## Testing and Validation
183
-
184
- ### Test Categories
185
- - Unit tests for core functions
186
- - Integration tests for pipeline
187
- - Configuration validation tests
188
- - Model loading and saving tests
189
- - Dataset processing tests
190
-
191
- ### Debugging Tools
192
- - Standalone test scripts
193
- - Configuration validation
194
- - Model testing utilities
195
- - Dataset inspection tools
196
-
197
- ## Performance Optimization
198
-
199
- ### H100 Optimizations
200
- - Larger batch sizes (16 vs 8 for A100)
201
- - Reduced gradient accumulation (4 vs 16)
202
- - Higher learning rates (8e-6 vs 5e-6)
203
- - Optimized data loading (4 workers, pin memory)
204
-
205
- ### Memory Management
206
- - Gradient checkpointing for large models
207
- - Mixed precision training
208
- - Dynamic batch sizing
209
- - Memory-efficient data loading
210
-
211
- ## Security and Best Practices
212
-
213
- ### Token Management
214
- - Never hardcode tokens in code
215
- - Use environment variables
216
- - Validate tokens before use
217
- - Secure token storage
218
-
219
- ### Data Privacy
220
- - Filter sensitive data from datasets
221
- - Validate dataset contents
222
- - Secure data transmission
223
- - Proper data disposal
224
-
225
- ## Deployment and CI/CD
226
-
227
- ### Environment Setup
228
- - Python virtual environments
229
- - CUDA-compatible PyTorch
230
- - Required dependencies installation
231
- - System package management
232
-
233
- ### Automated Deployment
234
- - Trackio Space deployment
235
- - HF Dataset setup
236
- - Model repository creation
237
- - Configuration file generation
238
-
239
- ## Troubleshooting Guidelines
240
-
241
- ### Common Issues
242
- - CUDA out of memory: Reduce batch size
243
- - Network timeouts: Check internet connection
244
- - Token validation: Verify HF token permissions
245
- - Dataset loading: Check dataset accessibility
246
-
247
- ### Debugging Steps
248
- 1. Check system requirements
249
- 2. Validate configuration
250
- 3. Test individual components
251
- 4. Review logs and error messages
252
- 5. Verify external service connectivity
253
-
254
- ## Future Enhancements
255
-
256
- ### Planned Features
257
- - Multi-GPU training support
258
- - Advanced dataset sampling strategies
259
- - Automated hyperparameter optimization
260
- - Enhanced monitoring and visualization
261
- - Support for additional model architectures
262
-
263
- ### Extensibility
264
- - Modular configuration system
265
- - Plugin architecture for custom features
266
- - Support for custom datasets and models
267
- - Flexible monitoring integration
268
-
269
- ---
270
-
271
- **When working with this codebase:**
272
- - Always consider the end-to-end pipeline workflow
273
- - Follow the established configuration patterns
274
- - Include proper error handling and validation
275
- - Maintain comprehensive documentation
276
- - Test changes thoroughly before deployment
277
- - Consider performance implications for different hardware configurations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
launch.sh CHANGED
@@ -489,113 +489,45 @@ echo "==========================================="
489
  cd ../..
490
  create_training_config "$CONFIG_FILE"
491
 
492
- # Step 13: Download and prepare dataset
493
- print_step "Step 13: Preparing Dataset"
494
- echo "==============================="
495
 
496
- python -c "
497
- from datasets import load_dataset
498
- import json
499
- import os
500
- import random
501
-
502
- # Load dataset
503
- print('Loading dataset: $DATASET_NAME')
504
- dataset = load_dataset('$DATASET_NAME')
505
-
506
- # Create dataset directory
507
- os.makedirs('training_dataset', exist_ok=True)
508
-
509
- # Convert to training format
510
- def convert_to_training_format(example):
511
- # Handle different dataset formats
512
- if 'prompt' in example and 'completion' in example:
513
- return {
514
- 'prompt': example['prompt'],
515
- 'completion': example['completion']
516
- }
517
- elif 'instruction' in example and 'output' in example:
518
- return {
519
- 'prompt': example['instruction'],
520
- 'completion': example['output']
521
- }
522
- elif 'messages' in example:
523
- # Handle chat format
524
- messages = example['messages']
525
- if len(messages) >= 2:
526
- return {
527
- 'prompt': messages[0]['content'],
528
- 'completion': messages[1]['content']
529
- }
530
- else:
531
- # Fallback
532
- return {
533
- 'prompt': str(example.get('input', '')),
534
- 'completion': str(example.get('output', ''))
535
- }
536
-
537
- # Process train split
538
- train_data = []
539
- for example in dataset['train']:
540
- training_example = convert_to_training_format(example)
541
- if training_example['prompt'] and training_example['completion']:
542
- train_data.append(training_example)
543
-
544
- # Apply dataset sampling for lightweight configuration
545
- if '$TRAINING_CONFIG_TYPE' == 'H100 Lightweight (Rapid)' and len(train_data) > ${DATASET_SAMPLE_SIZE:-0}:
546
- print(f'Sampling {${DATASET_SAMPLE_SIZE:-80000}} random samples from {len(train_data)} total samples')
547
- random.seed(42) # For reproducibility
548
- train_data = random.sample(train_data, ${DATASET_SAMPLE_SIZE:-80000})
549
- print(f'Selected {len(train_data)} samples for lightweight training')
550
-
551
- # Process validation split if available
552
- val_data = []
553
- if 'validation' in dataset:
554
- for example in dataset['validation']:
555
- training_example = convert_to_training_format(example)
556
- if training_example['prompt'] and training_example['completion']:
557
- val_data.append(training_example)
558
-
559
- # For lightweight config, also sample validation if it's large
560
- if '$TRAINING_CONFIG_TYPE' == 'H100 Lightweight (Rapid)' and len(val_data) > 1000:
561
- print(f'Sampling 1000 random validation samples from {len(val_data)} total')
562
- random.seed(42) # For reproducibility
563
- val_data = random.sample(val_data, 1000)
564
-
565
- # Save to files
566
- with open('training_dataset/train.json', 'w') as f:
567
- json.dump(train_data, f, indent=2)
568
-
569
- if val_data:
570
- with open('training_dataset/validation.json', 'w') as f:
571
- json.dump(val_data, f, indent=2)
572
-
573
- print(f'Dataset prepared: {len(train_data)} train samples, {len(val_data)} validation samples')
574
- "
575
 
576
  # Step 14: Calculate training parameters
577
  print_step "Step 14: Calculating Training Parameters"
578
  echo "============================================"
579
 
580
- TOTAL_SAMPLES=$(python -c "import json; data=json.load(open('training_dataset/train.json')); print(len(data))")
581
  EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
582
- STEPS_PER_EPOCH=$((TOTAL_SAMPLES / EFFECTIVE_BATCH_SIZE))
583
- MAX_STEPS=$((STEPS_PER_EPOCH * MAX_EPOCHS))
584
-
585
- echo " Total samples: $TOTAL_SAMPLES"
586
  echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
587
- echo " Steps per epoch: $STEPS_PER_EPOCH"
588
- echo " Total training steps: $MAX_STEPS"
 
 
589
 
590
  # Step 15: Start training
591
  print_step "Step 15: Starting Training"
592
  echo "=============================="
593
 
594
- python src/train.py "$CONFIG_FILE" \
595
- --dataset_dir training_dataset \
 
 
 
 
 
 
 
 
 
596
  --out_dir /output-checkpoint \
597
  --init_from scratch \
598
- --max_iters $MAX_STEPS \
599
  --batch_size $BATCH_SIZE \
600
  --learning_rate $LEARNING_RATE \
601
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
@@ -613,38 +545,23 @@ python src/train.py "$CONFIG_FILE" \
613
  print_step "Step 16: Pushing Model to HF Hub"
614
  echo "====================================="
615
 
 
 
 
 
 
616
  python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
617
  --token "$HF_TOKEN" \
618
  --trackio-url "$TRACKIO_URL" \
619
  --experiment-name "$EXPERIMENT_NAME" \
620
  --dataset-repo "$TRACKIO_DATASET_REPO"
621
 
622
- # Step 17: Test the uploaded model
623
- print_step "Step 17: Testing Uploaded Model"
624
- echo "==================================="
625
-
626
- python -c "
627
- from transformers import AutoModelForCausalLM, AutoTokenizer
628
- import torch
629
-
630
- print('Loading uploaded model...')
631
- model = AutoModelForCausalLM.from_pretrained('$REPO_NAME', torch_dtype=torch.float16, device_map='auto')
632
- tokenizer = AutoTokenizer.from_pretrained('$REPO_NAME')
633
-
634
- print('Testing model generation...')
635
- prompt = 'Hello, how are you?'
636
- inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
637
- outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
638
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
639
- print(f'Prompt: {prompt}')
640
- print(f'Response: {response}')
641
- print('βœ… Model test completed successfully!')
642
- "
643
-
644
- # Step 18: Create summary report
645
- print_step "Step 18: Creating Summary Report"
646
  echo "===================================="
647
 
 
 
648
  cat > training_summary.md << EOF
649
  # SmolLM3 Fine-tuning Summary
650
 
@@ -665,8 +582,6 @@ fi)
665
  - **Gradient Accumulation**: $GRADIENT_ACCUMULATION_STEPS
666
  - **Learning Rate**: $LEARNING_RATE
667
  - **Max Epochs**: $MAX_EPOCHS
668
- - **Max Steps**: $MAX_STEPS
669
- - **Total Samples**: $TOTAL_SAMPLES
670
  - **Sequence Length**: $MAX_SEQ_LENGTH
671
 
672
  ## Results
@@ -682,7 +597,6 @@ fi)
682
 
683
  ## Files Created
684
  - Training configuration: \`$CONFIG_FILE\`
685
- - Dataset: \`training_dataset/\`
686
  - Model checkpoint: \`/output-checkpoint/\`
687
  - Training logs: \`training.log\`
688
  - Summary report: \`training_summary.md\`
@@ -690,6 +604,10 @@ EOF
690
 
691
  print_status "Summary report saved to: training_summary.md"
692
 
 
 
 
 
693
  # Final summary
694
  echo ""
695
  print_header "πŸŽ‰ End-to-End Pipeline Completed Successfully!"
 
489
  cd ../..
490
  create_training_config "$CONFIG_FILE"
491
 
492
+ # Step 13: Dataset preparation (handled by src/data.py during training)
493
+ print_step "Step 13: Dataset Configuration"
494
+ echo "=================================="
495
 
496
+ print_info "Dataset will be loaded directly by src/data.py during training"
497
+ print_info "Dataset: $DATASET_NAME"
498
+ if [ "$TRAINING_CONFIG_TYPE" = "H100 Lightweight (Rapid)" ]; then
499
+ print_info "Sample size: ${DATASET_SAMPLE_SIZE:-80000} (will be handled by data.py)"
500
+ fi
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
501
 
502
  # Step 14: Calculate training parameters
503
  print_step "Step 14: Calculating Training Parameters"
504
  echo "============================================"
505
 
506
+ # Estimate training steps
507
  EFFECTIVE_BATCH_SIZE=$((BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS))
 
 
 
 
508
  echo " Effective batch size: $EFFECTIVE_BATCH_SIZE"
509
+ echo " Learning rate: $LEARNING_RATE"
510
+ echo " Max epochs: $MAX_EPOCHS"
511
+ echo " Sequence length: $MAX_SEQ_LENGTH"
512
+ echo " Training steps will be calculated by the training script"
513
 
514
  # Step 15: Start training
515
  print_step "Step 15: Starting Training"
516
  echo "=============================="
517
 
518
+ print_info "Using existing scripts/training/train.py script with the following parameters:"
519
+ echo " Model: $MODEL_NAME"
520
+ echo " Dataset: $DATASET_NAME"
521
+ echo " Output: /output-checkpoint"
522
+ echo " Batch size: $BATCH_SIZE"
523
+ echo " Learning rate: $LEARNING_RATE"
524
+ echo " Sequence length: $MAX_SEQ_LENGTH"
525
+
526
+ # Run the existing training script
527
+ python scripts/training/train.py "$CONFIG_FILE" \
528
+ --dataset_dir "$DATASET_NAME" \
529
  --out_dir /output-checkpoint \
530
  --init_from scratch \
 
531
  --batch_size $BATCH_SIZE \
532
  --learning_rate $LEARNING_RATE \
533
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
 
545
  print_step "Step 16: Pushing Model to HF Hub"
546
  echo "====================================="
547
 
548
+ print_info "Using scripts/model_tonic/push_to_huggingface.py script"
549
+ echo " Checkpoint: /output-checkpoint"
550
+ echo " Repository: $REPO_NAME"
551
+
552
+ # Run the existing push script
553
  python scripts/model_tonic/push_to_huggingface.py /output-checkpoint "$REPO_NAME" \
554
  --token "$HF_TOKEN" \
555
  --trackio-url "$TRACKIO_URL" \
556
  --experiment-name "$EXPERIMENT_NAME" \
557
  --dataset-repo "$TRACKIO_DATASET_REPO"
558
 
559
+ # Step 17: Create summary report
560
+ print_step "Step 17: Creating Summary Report"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
561
  echo "===================================="
562
 
563
+
564
+
565
  cat > training_summary.md << EOF
566
  # SmolLM3 Fine-tuning Summary
567
 
 
582
  - **Gradient Accumulation**: $GRADIENT_ACCUMULATION_STEPS
583
  - **Learning Rate**: $LEARNING_RATE
584
  - **Max Epochs**: $MAX_EPOCHS
 
 
585
  - **Sequence Length**: $MAX_SEQ_LENGTH
586
 
587
  ## Results
 
597
 
598
  ## Files Created
599
  - Training configuration: \`$CONFIG_FILE\`
 
600
  - Model checkpoint: \`/output-checkpoint/\`
601
  - Training logs: \`training.log\`
602
  - Summary report: \`training_summary.md\`
 
604
 
605
  print_status "Summary report saved to: training_summary.md"
606
 
607
+ # Clean up temporary files
608
+ print_info "Cleaning up temporary files..."
609
+ rm -f deploy_input.txt
610
+
611
  # Final summary
612
  echo ""
613
  print_header "πŸŽ‰ End-to-End Pipeline Completed Successfully!"
scripts/trackio_tonic/deploy_trackio_space.py CHANGED
@@ -30,13 +30,21 @@ class TrackioSpaceDeployer:
30
  cmd = [
31
  "huggingface-cli", "repo", "create",
32
  f"{self.username}/{self.space_name}",
33
- "--type", "space",
34
- "--space-sdk", "gradio",
35
- "--space-hardware", "cpu-basic"
36
  ]
37
 
 
38
  result = subprocess.run(cmd, capture_output=True, text=True)
39
 
 
 
 
 
 
 
 
 
 
40
  if result.returncode == 0:
41
  print(f"βœ… Space created successfully: {self.space_url}")
42
  return True
 
30
  cmd = [
31
  "huggingface-cli", "repo", "create",
32
  f"{self.username}/{self.space_name}",
33
+ "--type", "space"
 
 
34
  ]
35
 
36
+ # Try to create the space first
37
  result = subprocess.run(cmd, capture_output=True, text=True)
38
 
39
+ if result.returncode != 0:
40
+ # Try alternative approach without space-specific flags
41
+ print("Retrying with basic space creation...")
42
+ cmd = [
43
+ "huggingface-cli", "repo", "create",
44
+ f"{self.username}/{self.space_name}"
45
+ ]
46
+ result = subprocess.run(cmd, capture_output=True, text=True)
47
+
48
  if result.returncode == 0:
49
  print(f"βœ… Space created successfully: {self.space_url}")
50
  return True
src/config.py CHANGED
@@ -3,9 +3,27 @@ Configuration management for SmolLM3 fine-tuning
3
  """
4
 
5
  import os
 
6
  import importlib.util
7
  from typing import Any
8
- from config.train_smollm3 import SmolLM3Config, get_config as get_default_config
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  def get_config(config_path: str) -> SmolLM3Config:
11
  """Load configuration from file or return default"""
 
3
  """
4
 
5
  import os
6
+ import sys
7
  import importlib.util
8
  from typing import Any
9
+
10
+ # Add the project root to Python path
11
+ project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
12
+ if project_root not in sys.path:
13
+ sys.path.insert(0, project_root)
14
+
15
+ # Add config directory to path
16
+ config_dir = os.path.join(project_root, 'config')
17
+ if config_dir not in sys.path:
18
+ sys.path.insert(0, config_dir)
19
+
20
+ try:
21
+ from config.train_smollm3 import SmolLM3Config, get_config as get_default_config
22
+ except ImportError:
23
+ # Fallback: try direct import
24
+ import sys
25
+ sys.path.insert(0, os.path.join(project_root, 'config'))
26
+ from train_smollm3 import SmolLM3Config, get_config as get_default_config
27
 
28
  def get_config(config_path: str) -> SmolLM3Config:
29
  """Load configuration from file or return default"""
src/train.py CHANGED
@@ -16,7 +16,17 @@ from typing import Optional, Dict, Any
16
  # Add the current directory to the path for imports
17
  sys.path.append(os.path.dirname(os.path.abspath(__file__)))
18
 
19
- from config import get_config
 
 
 
 
 
 
 
 
 
 
20
  from model import SmolLM3Model
21
  from data import SmolLM3Dataset
22
  from trainer import SmolLM3Trainer
 
16
  # Add the current directory to the path for imports
17
  sys.path.append(os.path.dirname(os.path.abspath(__file__)))
18
 
19
+ # Add project root to path for config imports
20
+ project_root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
21
+ if project_root not in sys.path:
22
+ sys.path.insert(0, project_root)
23
+
24
+ try:
25
+ from config import get_config
26
+ except ImportError:
27
+ # Fallback: try direct import
28
+ sys.path.insert(0, os.path.join(project_root, 'src'))
29
+ from config import get_config
30
  from model import SmolLM3Model
31
  from data import SmolLM3Dataset
32
  from trainer import SmolLM3Trainer
tests/test_dataset.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script to verify OpenHermes-FR dataset loading
4
+ """
5
+
6
+ from datasets import load_dataset
7
+ import json
8
+ import random
9
+
10
+ def test_openhermes_fr():
11
+ """Test loading and processing OpenHermes-FR dataset"""
12
+
13
+ print("Loading OpenHermes-FR dataset...")
14
+ try:
15
+ dataset = load_dataset('legmlai/openhermes-fr')
16
+ print(f"βœ… Dataset loaded successfully")
17
+ print(f" Train samples: {len(dataset['train'])}")
18
+ if 'validation' in dataset:
19
+ print(f" Validation samples: {len(dataset['validation'])}")
20
+
21
+ # Show sample structure
22
+ sample = dataset['train'][0]
23
+ print(f"\nπŸ“‹ Sample structure:")
24
+ for key, value in sample.items():
25
+ if isinstance(value, str) and len(value) > 100:
26
+ print(f" {key}: {value[:100]}...")
27
+ else:
28
+ print(f" {key}: {value}")
29
+
30
+ # Test conversion
31
+ print(f"\nπŸ”„ Testing conversion...")
32
+
33
+ def convert_to_training_format(example):
34
+ # Handle OpenHermes-FR format specifically
35
+ if 'prompt' in example and 'accepted_completion' in example:
36
+ return {
37
+ 'prompt': example['prompt'],
38
+ 'completion': example['accepted_completion']
39
+ }
40
+ elif 'prompt' in example and 'completion' in example:
41
+ return {
42
+ 'prompt': example['prompt'],
43
+ 'completion': example['completion']
44
+ }
45
+ else:
46
+ return None
47
+
48
+ # Process first 10 examples
49
+ train_data = []
50
+ for i, example in enumerate(dataset['train'][:10]):
51
+ training_example = convert_to_training_format(example)
52
+ if training_example and training_example['prompt'] and training_example['completion']:
53
+ # Filter out bad entries
54
+ if 'bad_entry' in example and example['bad_entry']:
55
+ print(f" Skipping bad entry {i}")
56
+ continue
57
+ train_data.append(training_example)
58
+ print(f" βœ… Converted example {i}")
59
+
60
+ print(f"\nπŸ“Š Conversion results:")
61
+ print(f" Converted: {len(train_data)} valid examples")
62
+
63
+ if train_data:
64
+ print(f"\nπŸ“ Sample converted example:")
65
+ sample = train_data[0]
66
+ print(f" Prompt: {sample['prompt'][:100]}...")
67
+ print(f" Completion: {sample['completion'][:100]}...")
68
+
69
+ # Test sampling
70
+ if len(dataset['train']) > 100:
71
+ print(f"\n🎲 Testing sampling...")
72
+ random.seed(42)
73
+ sampled_indices = random.sample(range(len(dataset['train'])), 5)
74
+ print(f" Sampled indices: {sampled_indices}")
75
+
76
+ return True
77
+
78
+ except Exception as e:
79
+ print(f"❌ Error loading dataset: {e}")
80
+ return False
81
+
82
+ if __name__ == "__main__":
83
+ success = test_openhermes_fr()
84
+ if success:
85
+ print("\nβœ… Dataset test completed successfully!")
86
+ else:
87
+ print("\n❌ Dataset test failed!")
88
+ exit(1)
tests/test_dataset_loading.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script to verify dataset loading works correctly
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ import json
9
+ from datasets import load_dataset
10
+
11
+ def test_dataset_loading():
12
+ """Test loading the OpenHermes-FR dataset"""
13
+ print("Testing dataset loading...")
14
+
15
+ try:
16
+ # Load the dataset
17
+ dataset = load_dataset("legmlai/openhermes-fr")
18
+ print(f"βœ… Dataset loaded successfully")
19
+ print(f" Train samples: {len(dataset['train'])}")
20
+
21
+ # Check the first few examples
22
+ print("\nSample examples:")
23
+ for i in range(min(3, len(dataset['train'])):
24
+ example = dataset['train'][i]
25
+ print(f"\nExample {i+1}:")
26
+ print(f" Keys: {list(example.keys())}")
27
+ print(f" Prompt: {example.get('prompt', 'N/A')[:100]}...")
28
+ print(f" Accepted completion: {example.get('accepted_completion', 'N/A')[:100]}...")
29
+ print(f" Bad entry: {example.get('bad_entry', 'N/A')}")
30
+
31
+ # Test filtering bad entries
32
+ print(f"\nFiltering bad entries...")
33
+ original_size = len(dataset['train'])
34
+ filtered_dataset = dataset['train'].filter(lambda x: not x.get('bad_entry', False))
35
+ filtered_size = len(filtered_dataset)
36
+ print(f" Original size: {original_size}")
37
+ print(f" Filtered size: {filtered_size}")
38
+ print(f" Removed: {original_size - filtered_size} bad entries")
39
+
40
+ # Test conversion to training format
41
+ print(f"\nTesting conversion to training format...")
42
+ train_data = []
43
+ for i, example in enumerate(filtered_dataset):
44
+ if i >= 5: # Just test first 5 examples
45
+ break
46
+
47
+ if 'prompt' in example and 'accepted_completion' in example:
48
+ train_data.append({
49
+ 'prompt': example['prompt'],
50
+ 'completion': example['accepted_completion']
51
+ })
52
+
53
+ print(f" Converted {len(train_data)} examples to training format")
54
+
55
+ # Save a small sample
56
+ os.makedirs('test_dataset', exist_ok=True)
57
+ with open('test_dataset/train.json', 'w') as f:
58
+ json.dump(train_data, f, indent=2)
59
+
60
+ print(f"βœ… Test completed successfully!")
61
+ print(f" Sample saved to: test_dataset/train.json")
62
+
63
+ return True
64
+
65
+ except Exception as e:
66
+ print(f"❌ Dataset loading failed: {e}")
67
+ return False
68
+
69
+ if __name__ == "__main__":
70
+ success = test_dataset_loading()
71
+ sys.exit(0 if success else 1)