File size: 1,459 Bytes
44fce99 a966fb8 44fce99 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
---
datasets:
- openslr/librispeech_asr
- amphion/Emilia-Dataset
- its5Q/bigger-ru-book
- mozilla-foundation/common_voice_12_0
language:
- en
- ru
- uk
base_model:
- Qwen/Qwen2.5-0.5B
---
#### **Model Performance Overview**
**Metrics**:
- **PESQ**: Perceptual Evaluation of Speech Quality (higher = better).
- **STOI**: Short-Time Objective Intelligibility (closer to 1 = better).
- **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
| Model | PESQ@200 | STOI@200 | SI-SDR@200 |
|---------------------------|----------------|---------------|-------------------|
| Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 |
| [**SALT-tts**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-tts) | 1.11 | 0.16 | 23.58 |
| [**SALT-tts+asr**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-asr-tts) | 1.09 | 0.18 | 23.09 |
---
#### **Our Solution**
- **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks.
- **Training**:
- BigCodec tokenizer (supports Slavic languages) for speech generation.
- SpeechTokenizer (semantic tokens only) for speech recognition.
- Training time: **168 H100 GPU hours**.
- **Advantages**: Unified LM loss for dual tasks, minimal training overhead.
---
#### **Resources**
- Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o)
--- |