videoloc/seamless-crossattention
Model Description
This is a SeamlessCrossAttention model that processes audio and text inputs with advanced cross-modal attention mechanisms to predict Time To Edit (TTE) for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, leveraging sophisticated cross-attention patterns between audio and text modalities.
The model extends the SeamlessM4T architecture with bidirectional cross-attention layers that allow audio and text representations to attend to each other, creating rich cross-modal embeddings that capture temporal and semantic relationships across 5 languages: English, French, Spanish, Italian, and German.
Key Features
- Cross-Modal Attention: Bidirectional attention between audio and text representations
- Advanced Architecture: Audio-to-text and text-to-audio attention mechanisms
- Scalar Mixing: Learnable combination of global and attended embeddings
- Embedding Regularization: Optional L2 regularization for embedding stability
- Multimodal Processing: Simultaneously processes audio (16kHz) and text inputs
- Frozen Encoders: Uses pre-trained SeamlessM4T encoders (frozen for stability)
- TTE Prediction: Predicts editing time required for subtitle segments
- Direct Output: Raw time values in seconds for immediate use
Model Architecture
The model implements sophisticated cross-modal attention mechanisms:
Audio Processing:
- SeamlessM4T speech encoder (frozen) processes raw audio input
- Audio projection layer maps speech encoder output to 1024 dimensions
- Layer normalization for stability
Text Processing:
- SeamlessM4T text encoder (frozen) processes tokenized text input
- Text projection layer maps text encoder output to 1024 dimensions
- Layer normalization for stability
Cross-Modal Attention:
- Audio-to-Text Attention: Each audio token attends to all text tokens
- Text-to-Audio Attention: Each text token attends to all audio tokens
- Multi-head attention (8 heads) with dropout for regularization
- Bidirectional information flow between modalities
Feature Fusion:
- Global pooling of original audio and text embeddings
- Global pooling of cross-attended embeddings
- Scalar mixing layer combines all four embeddings with learnable weights
- Final embedding captures both global and cross-modal patterns
Regression Head:
- Multi-layer perceptron: 1024 → 512 → 256 → 1
- ReLU activations and dropout for regularization
- Single output for TTE prediction (regression, in seconds)
Optional Regularization:
- L2 regularization on embedding norms for training stability
- Configurable regularization strength
Quick Start
Installation
pip install transformers torch torchaudio huggingface_hub
Basic Usage
from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util
# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="modeling_seamless_crossattention.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_crossattention", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)
# Now load the model using the custom class
config = modeling_module.SeamlessCrossAttentionConfig.from_pretrained("videoloc/seamless-crossattention")
model = modeling_module.HFSeamlessCrossAttention.from_pretrained("videoloc/seamless-crossattention")
# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)
# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
processor="facebook/hf-seamless-m4t-medium",
max_audio_length_sec=8.0,
max_text_length=256
)
# Prepare your data
your_data = [
{
'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz
'raw_text': "Your subtitle text here",
# Note: Cross-attention model doesn't require translation features
}
]
# Process and run inference
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
outputs = model(**batch)
tte_prediction = outputs.logits.item()
print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
Model Details
- Base Model: SeamlessM4T (facebook/hf-seamless-m4t-medium)
- Audio Encoder: Frozen SeamlessM4T speech encoder
- Text Encoder: Frozen SeamlessM4T text encoder
- Hidden Size: 1024
- Attention Heads: 8 (configurable)
- Cross-Attention: Bidirectional (audio↔text)
- Scalar Mix: 4 embeddings (audio global, text global, audio→text, text→audio)
- Audio Input: 16kHz
- Output: Single regression value (TTE in seconds)
- Task: Subtitle editing time prediction
Data Format
Your input data should be a list of dictionaries with:
raw_audio
: NumPy array of audio samples (16kHz sampling rate)raw_text
: String of subtitle textlabels
: Target TTE values in seconds (optional, for training)
Example:
data = [
{
'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz
'raw_text': "Subtitle text content",
'labels': 2.5 # optional TTE target value in seconds
}
]
Performance Metrics
- Best Eval RMSE: 33.34
Training Details
- Base Model: facebook/hf-seamless-m4t-medium
- Model Type: seamless_cross_attention
- Epochs: 10
- Batch Size (Train): 32
- Batch Size (Eval): 64
- Learning Rate: 1.2e-4
- LR Scheduler: cosine_with_restarts
- Warmup Ratio: 0.05
- Weight Decay: 0.001
- Optimizer: AdamW (torch)
- Max Grad Norm: 1.0
- FP16: True
- Early Stopping Patience: 5
- Audio Max Length: 8.0 seconds
- Text Max Length: 256 tokens
- Sample Rate: 16kHz
- Cross-Attention: 8-head multi-head attention
- Scalar Mixing: 4 embedding types
- Embedding Regularization: Optional L2
- Normalization: None (raw values)
- Dataset Split: 80/20 train/test
- Random Seed: 42
- Metric: RMSE (lower is better)
Training Configuration
The model was trained with the following specifications:
- Dataset: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE)
- Train/Test Split: 80/20 with random seed 42
- Audio Processing: 16kHz sampling, max 8.0 seconds, no offset
- Text Processing: Max 256 tokens
- Cross-Attention: 8-head multi-head attention with dropout
- Scalar Mixing: Learnable combination of 4 embedding types
- Normalization: None (raw TTE values in seconds)
- Caching: Audio segments cached and compressed for efficiency
Usage Notes
- This is the advanced cross-attention variant with sophisticated attention mechanisms
- For simpler models, see
seamless-basic
,seamless-translation
, orseamless-langpairs
- Model expects 16kHz audio input (automatically resampled by data collator)
- Cross-attention captures complex temporal and semantic relationships
- No feature normalization applied - outputs raw TTE predictions in seconds
- Optimized for detailed subtitle editing time estimation tasks
Architecture Advantages
- Rich Cross-Modal Interactions: Audio and text modalities directly attend to each other
- Temporal Alignment: Cross-attention naturally captures temporal relationships
- Semantic Understanding: Text-to-audio attention helps model understand content meaning
- Flexible Combination: Scalar mixing allows model to weight different embedding types
- Regularization Options: Optional embedding regularization for training stability
Limitations
- Higher computational complexity than basic models due to attention mechanisms
- Requires more training data to fully leverage cross-attention capabilities
- Designed for TTE prediction, not general audio-text matching
- Performance may vary on out-of-domain content or different editing workflows
- Requires specific data preprocessing (use included data collator)
Related Models
- seamless-basic: Basic audio+text model without attention mechanisms
- seamless-translation: Includes translation awareness but no cross-attention
- seamless-langpairs: Includes language pair embeddings but no cross-attention
- Downloads last month
- 23
Model tree for videoloc/seamless-crossattention
Base model
facebook/hf-seamless-m4t-medium