metadata
datasets:
- openslr/librispeech_asr
- amphion/Emilia-Dataset
- its5Q/bigger-ru-book
- mozilla-foundation/common_voice_12_0
language:
- en
- ru
- uk
base_model:
- Qwen/Qwen2.5-0.5B
Model Performance Overview
Metrics:
- PESQ: Perceptual Evaluation of Speech Quality (higher = better).
- STOI: Short-Time Objective Intelligibility (closer to 1 = better).
- SI-SDR: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
Model | PESQ@200 | STOI@200 | SI-SDR@200 |
---|---|---|---|
Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 |
SALT-tts | 1.11 | 0.16 | 23.58 |
SALT-tts+asr | 1.09 | 0.18 | 23.09 |
Our Solution
- Method: Extends a pre-trained LLM with audio tokens and fine-tunes on TTS and ASR tasks.
- Training:
- BigCodec tokenizer (supports Slavic languages) for speech generation.
- SpeechTokenizer (semantic tokens only) for speech recognition.
- Training time: 168 H100 GPU hours.
- Advantages: Unified LM loss for dual tasks, minimal training overhead.
Resources
- Code: GitHub Repo