Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /ENHANCED_MODEL_CARD_METADATA.md

Tonic

improves model card

3eb616f verified about 2 months ago

preview code

raw

history blame

9.31 kB

Enhanced Model Card Metadata System

Overview

The enhanced model card system now includes comprehensive YAML metadata that follows the Hugging Face Model Cards specification. This ensures maximum compatibility with the Hugging Face Hub and provides rich metadata for model discovery and usage.

Metadata Structure

Core Metadata Fields

The model card template now includes the following metadata fields:

---
language:
- en
- fr
license: apache-2.0
library_name: transformers
tags:
- smollm3
- fine-tuned
- causal-lm
- text-generation
- quantized
- dataset:OpenHermes-FR
- config:H100 Lightweight
pipeline_tag: text-generation
base_model: HuggingFaceTB/SmolLM3-3B
datasets:
- OpenHermes-FR
---

Conditional Metadata

The system supports conditional metadata based on model configuration:

Quantized Models

When quantized models are available, additional metadata is included:

quantization_types:
- int8_weight_only
- int4_weight_only

Model Index (Evaluation Results)

The system automatically generates structured evaluation results:

model-index:
- name: Model Name
  results:
  - task:
      type: text-generation
    dataset:
      name: OpenHermes-FR
      type: OpenHermes-FR
    metrics:
    - name: Training Loss
      type: loss
      value: "2.1"
    - name: Validation Loss
      type: loss
      value: "2.3"
    - name: Perplexity
      type: perplexity
      value: "9.8"

For quantized models, additional entries are included:

- name: Model Name (int8 quantized)
  results:
  - task:
      type: text-generation
    dataset:
      name: OpenHermes-FR
      type: OpenHermes-FR
    metrics:
    - name: Memory Reduction
      type: memory_efficiency
      value: "~50%"
    - name: Inference Speed
      type: speed
      value: "Faster"

Metadata Fields Explained

Required Fields

Field	Description	Example
`language`	Supported languages	`["en", "fr"]`
`license`	Model license	`"apache-2.0"`
`library_name`	Primary library	`"transformers"`
`tags`	Model tags for discovery	`["smollm3", "fine-tuned"]`
`pipeline_tag`	Task type	`"text-generation"`
`base_model`	Original model	`"HuggingFaceTB/SmolLM3-3B"`

Optional Fields

Field	Description	Example
`datasets`	Training datasets	`["OpenHermes-FR"]`
`author`	Model author	`"Your Name"`
`experiment_name`	Experiment tracking	`"smollm3-experiment"`
`trackio_url`	Monitoring URL	`"https://trackio.space/exp"`
`hardware`	Training hardware	`"GPU (H100/A100)"`
`training_config`	Configuration type	`"H100 Lightweight"`
`trainer_type`	Trainer used	`"SFTTrainer"`
`batch_size`	Training batch size	`"8"`
`learning_rate`	Learning rate	`"5e-6"`
`max_epochs`	Number of epochs	`"3"`
`max_seq_length`	Sequence length	`"2048"`
`gradient_accumulation_steps`	Gradient accumulation	`"16"`

Training Results

Field	Description	Example
`training_loss`	Final training loss	`"2.1"`
`validation_loss`	Final validation loss	`"2.3"`
`perplexity`	Model perplexity	`"9.8"`

Benefits of Enhanced Metadata

1. Improved Discovery

Filtering: Users can filter models by dataset, configuration, or hardware
Search: Enhanced search capabilities on the Hugging Face Hub
Tags: Automatic tag generation for better categorization

2. Better Model Cards

Structured Data: Evaluation results are displayed in widgets
Consistent Format: Follows Hugging Face standards
Rich Information: Comprehensive model information

3. Integration Benefits

Papers with Code: Model index data can be indexed in leaderboards
API Compatibility: Better integration with Hugging Face APIs
Automated Tools: Support for automated model analysis

Usage Examples

Basic Model Card Generation

python scripts/model_tonic/generate_model_card.py \
    --repo-name "username/model-name" \
    --model-name "My Fine-tuned Model" \
    --dataset-name "OpenHermes-FR" \
    --training-config "H100 Lightweight" \
    --batch-size "8" \
    --learning-rate "5e-6" \
    --max-epochs "3" \
    --training-loss "2.1" \
    --validation-loss "2.3" \
    --perplexity "9.8" \
    --output "README.md"

With Quantized Models

python scripts/model_tonic/generate_model_card.py \
    --repo-name "username/model-name" \
    --model-name "My Fine-tuned Model" \
    --dataset-name "OpenHermes-FR" \
    --training-config "H100 Lightweight" \
    --batch-size "8" \
    --learning-rate "5e-6" \
    --max-epochs "3" \
    --training-loss "2.1" \
    --validation-loss "2.3" \
    --perplexity "9.8" \
    --quantized-models \
    --output "README.md"

Template Variables

The enhanced template supports all the original variables plus new metadata fields:

New Variables

Variable	Description	Default
`training_loss`	Training loss value	`"N/A"`
`validation_loss`	Validation loss value	`"N/A"`
`perplexity`	Model perplexity	`"N/A"`

Conditional Metadata

The template automatically includes:

Dataset Information: When dataset_name is provided
Quantization Types: When quantized_models is true
Evaluation Results: When training metrics are available
Hardware Information: When hardware_info is provided

Integration with Training Pipeline

Automatic Metadata Generation

The push script automatically extracts metadata from:

Training Configuration: Batch size, learning rate, epochs, etc.
Training Results: Loss values, perplexity, etc.
Model Information: Base model, hardware, etc.
Experiment Tracking: Trackio URLs, experiment names

Example Integration

# In push_to_huggingface.py
variables = {
    "model_name": f"{self.repo_name.split('/')[-1]} - Fine-tuned SmolLM3",
    "repo_name": self.repo_name,
    "base_model": "HuggingFaceTB/SmolLM3-3B",
    "dataset_name": training_config.get('dataset_name', 'OpenHermes-FR'),
    "training_config_type": training_config.get('training_config_type', 'Custom Configuration'),
    "trainer_type": training_config.get('trainer_type', 'SFTTrainer'),
    "batch_size": str(training_config.get('per_device_train_batch_size', 8)),
    "learning_rate": str(training_config.get('learning_rate', '5e-6')),
    "max_epochs": str(training_config.get('num_train_epochs', 3)),
    "hardware_info": self._get_hardware_info(),
    "training_loss": results.get('train_loss', 'N/A'),
    "validation_loss": results.get('eval_loss', 'N/A'),
    "perplexity": results.get('perplexity', 'N/A'),
    "quantized_models": False  # Updated if quantized models are added
}

Validation and Testing

Metadata Validation

The system includes validation for:

Required Fields: Ensures all required metadata is present
Format Validation: Validates YAML syntax and structure
Value Ranges: Checks for reasonable values in numeric fields
Conditional Logic: Verifies conditional metadata is properly included

Test Coverage

The test suite verifies:

Basic Metadata: All required fields are present
Conditional Metadata: Quantized model metadata is included when appropriate
Evaluation Results: Model index data is properly structured
Template Processing: Variable substitution works correctly

Best Practices

1. Metadata Completeness

Include all available training information
Provide accurate evaluation metrics
Use consistent naming conventions

2. Conditional Logic

Only include relevant metadata
Use conditional sections appropriately
Provide fallback values for missing data

3. Validation

Test metadata generation with various configurations
Verify YAML syntax is correct
Check that all variables are properly substituted

4. Documentation

Document all available metadata fields
Provide examples for each field type
Include troubleshooting information

Future Enhancements

Planned Features

Additional Metrics: Support for more evaluation metrics
Custom Metadata: User-defined metadata fields
Validation Rules: Configurable validation rules
Auto-Detection: Automatic detection of model features
Integration APIs: Better integration with external tools

Extensibility

The system is designed to be easily extensible:

New Fields: Easy to add new metadata fields
Custom Validators: Support for custom validation logic
Template Extensions: Support for template inheritance
API Integration: Easy integration with external APIs

Conclusion

The enhanced model card metadata system provides comprehensive, standards-compliant metadata that maximizes compatibility with the Hugging Face Hub while providing rich information for model discovery and usage. The system automatically generates appropriate metadata based on model configuration and training results, ensuring consistency and completeness across all model repositories.