|
--- |
|
datasets: |
|
- openslr/librispeech_asr |
|
- amphion/Emilia-Dataset |
|
- its5Q/bigger-ru-book |
|
- mozilla-foundation/common_voice_12_0 |
|
language: |
|
- en |
|
- ru |
|
- uk |
|
base_model: |
|
- Qwen/Qwen2.5-0.5B |
|
--- |
|
|
|
|
|
#### **Model Performance Overview** |
|
**Metrics**: |
|
- **PESQ**: Perceptual Evaluation of Speech Quality (higher = better). |
|
- **STOI**: Short-Time Objective Intelligibility (closer to 1 = better). |
|
- **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better). |
|
|
|
| Model | PESQ@200 | STOI@200 | SI-SDR@200 | |
|
|---------------------------|----------------|---------------|-------------------| |
|
| Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 | |
|
| [**SALT-tts**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-tts) | 1.11 | 0.16 | 23.58 | |
|
| [**SALT-tts+asr**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-asr-tts) | 1.09 | 0.18 | 23.09 | |
|
|
|
--- |
|
|
|
#### **Our Solution** |
|
- **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks. |
|
- **Training**: |
|
- BigCodec tokenizer (supports Slavic languages) for speech generation. |
|
- SpeechTokenizer (semantic tokens only) for speech recognition. |
|
- Training time: **168 H100 GPU hours**. |
|
- **Advantages**: Unified LM loss for dual tasks, minimal training overhead. |
|
|
|
|
|
--- |
|
|
|
#### **Resources** |
|
- Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o) |
|
|
|
--- |