videoloc/seamless-crossattention

Model Description

This is a SeamlessCrossAttention model that processes audio and text inputs with advanced cross-modal attention mechanisms to predict Time To Edit (TTE) for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, leveraging sophisticated cross-attention patterns between audio and text modalities.

The model extends the SeamlessM4T architecture with bidirectional cross-attention layers that allow audio and text representations to attend to each other, creating rich cross-modal embeddings that capture temporal and semantic relationships across 5 languages: English, French, Spanish, Italian, and German.

Key Features

Cross-Modal Attention: Bidirectional attention between audio and text representations
Advanced Architecture: Audio-to-text and text-to-audio attention mechanisms
Scalar Mixing: Learnable combination of global and attended embeddings
Embedding Regularization: Optional L2 regularization for embedding stability
Multimodal Processing: Simultaneously processes audio (16kHz) and text inputs
Frozen Encoders: Uses pre-trained SeamlessM4T encoders (frozen for stability)
TTE Prediction: Predicts editing time required for subtitle segments
Direct Output: Raw time values in seconds for immediate use

Model Architecture

The model implements sophisticated cross-modal attention mechanisms:

Audio Processing:
- SeamlessM4T speech encoder (frozen) processes raw audio input
- Audio projection layer maps speech encoder output to 1024 dimensions
- Layer normalization for stability
Text Processing:
- SeamlessM4T text encoder (frozen) processes tokenized text input
- Text projection layer maps text encoder output to 1024 dimensions
- Layer normalization for stability
Cross-Modal Attention:
- Audio-to-Text Attention: Each audio token attends to all text tokens
- Text-to-Audio Attention: Each text token attends to all audio tokens
- Multi-head attention (8 heads) with dropout for regularization
- Bidirectional information flow between modalities
Feature Fusion:
- Global pooling of original audio and text embeddings
- Global pooling of cross-attended embeddings
- Scalar mixing layer combines all four embeddings with learnable weights
- Final embedding captures both global and cross-modal patterns
Regression Head:
- Multi-layer perceptron: 1024 → 512 → 256 → 1
- ReLU activations and dropout for regularization
- Single output for TTE prediction (regression, in seconds)
Optional Regularization:
- L2 regularization on embedding norms for training stability
- Configurable regularization strength

Quick Start

Installation

pip install transformers torch torchaudio huggingface_hub

Basic Usage

from transformers import AutoModel, AutoConfig
from huggingface_hub import hf_hub_download
import torch
import numpy as np
import importlib.util

# Load model - custom architecture requires importing the model class
model_files = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="modeling_seamless_crossattention.py")
spec = importlib.util.spec_from_file_location("modeling_seamless_crossattention", model_files)
modeling_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(modeling_module)

# Now load the model using the custom class
config = modeling_module.SeamlessCrossAttentionConfig.from_pretrained("videoloc/seamless-crossattention")
model = modeling_module.HFSeamlessCrossAttention.from_pretrained("videoloc/seamless-crossattention")

# Load the data collator (included in this repo)
collator_file = hf_hub_download(repo_id="videoloc/seamless-crossattention", filename="data_collator.py")
spec = importlib.util.spec_from_file_location("data_collator", collator_file)
collator_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(collator_module)

# Initialize data collator
data_collator = collator_module.DataCollatorSimpleSeamless(
    processor="facebook/hf-seamless-m4t-medium",
    max_audio_length_sec=8.0,
    max_text_length=256
)

# Prepare your data
your_data = [
    {
        'raw_audio': np.random.randn(16000 * 5),  # 5 seconds at 16kHz
        'raw_text': "Your subtitle text here",
        # Note: Cross-attention model doesn't require translation features
    }
]

# Process and run inference
batch = data_collator(your_data)
model.eval()
with torch.no_grad():
    outputs = model(**batch)
    tte_prediction = outputs.logits.item()
    
print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")

Model Details

Base Model: SeamlessM4T (facebook/hf-seamless-m4t-medium)
Audio Encoder: Frozen SeamlessM4T speech encoder
Text Encoder: Frozen SeamlessM4T text encoder
Hidden Size: 1024
Attention Heads: 8 (configurable)
Cross-Attention: Bidirectional (audio↔text)
Scalar Mix: 4 embeddings (audio global, text global, audio→text, text→audio)
Audio Input: 16kHz
Output: Single regression value (TTE in seconds)
Task: Subtitle editing time prediction

Data Format

Your input data should be a list of dictionaries with:

raw_audio: NumPy array of audio samples (16kHz sampling rate)
raw_text: String of subtitle text
labels: Target TTE values in seconds (optional, for training)

Example:

data = [
    {
        'raw_audio': audio_samples,  # shape: (num_samples,) at 16kHz
        'raw_text': "Subtitle text content",
        'labels': 2.5  # optional TTE target value in seconds
    }
]

Performance Metrics

Best Eval RMSE: 33.34

Training Details

Base Model: facebook/hf-seamless-m4t-medium
Model Type: seamless_cross_attention
Epochs: 10
Batch Size (Train): 32
Batch Size (Eval): 64
Learning Rate: 1.2e-4
LR Scheduler: cosine_with_restarts
Warmup Ratio: 0.05
Weight Decay: 0.001
Optimizer: AdamW (torch)
Max Grad Norm: 1.0
FP16: True
Early Stopping Patience: 5
Audio Max Length: 8.0 seconds
Text Max Length: 256 tokens
Sample Rate: 16kHz
Cross-Attention: 8-head multi-head attention
Scalar Mixing: 4 embedding types
Embedding Regularization: Optional L2
Normalization: None (raw values)
Dataset Split: 80/20 train/test
Random Seed: 42
Metric: RMSE (lower is better)

Training Configuration

The model was trained with the following specifications:

Dataset: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE)
Train/Test Split: 80/20 with random seed 42
Audio Processing: 16kHz sampling, max 8.0 seconds, no offset
Text Processing: Max 256 tokens
Cross-Attention: 8-head multi-head attention with dropout
Scalar Mixing: Learnable combination of 4 embedding types
Normalization: None (raw TTE values in seconds)
Caching: Audio segments cached and compressed for efficiency

Usage Notes

This is the advanced cross-attention variant with sophisticated attention mechanisms
For simpler models, see seamless-basic, seamless-translation, or seamless-langpairs
Model expects 16kHz audio input (automatically resampled by data collator)
Cross-attention captures complex temporal and semantic relationships
No feature normalization applied - outputs raw TTE predictions in seconds
Optimized for detailed subtitle editing time estimation tasks

Architecture Advantages

Rich Cross-Modal Interactions: Audio and text modalities directly attend to each other
Temporal Alignment: Cross-attention naturally captures temporal relationships
Semantic Understanding: Text-to-audio attention helps model understand content meaning
Flexible Combination: Scalar mixing allows model to weight different embedding types
Regularization Options: Optional embedding regularization for training stability

Limitations

Higher computational complexity than basic models due to attention mechanisms
Requires more training data to fully leverage cross-attention capabilities
Designed for TTE prediction, not general audio-text matching
Performance may vary on out-of-domain content or different editing workflows
Requires specific data preprocessing (use included data collator)

Related Models

seamless-basic: Basic audio+text model without attention mechanisms
seamless-translation: Includes translation awareness but no cross-attention
seamless-langpairs: Includes language pair embeddings but no cross-attention

videoloc
/

seamless-crossattention