File size: 1,459 Bytes
44fce99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a966fb8
 
44fce99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
datasets:
- openslr/librispeech_asr
- amphion/Emilia-Dataset
- its5Q/bigger-ru-book
- mozilla-foundation/common_voice_12_0
language:
- en
- ru
- uk
base_model:
- Qwen/Qwen2.5-0.5B
---


#### **Model Performance Overview**  
**Metrics**:  
- **PESQ**: Perceptual Evaluation of Speech Quality (higher = better).  
- **STOI**: Short-Time Objective Intelligibility (closer to 1 = better).  
- **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better).   

| Model                     | PESQ@200       | STOI@200      | SI-SDR@200        | 
|---------------------------|----------------|---------------|-------------------|
| Fish-aduio-1.5        | 1.20     | 0.16    | 23.0         |
| [**SALT-tts**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-tts)           | 1.11     | 0.16    | 23.58         | 
| [**SALT-tts+asr**](https://huggingface.co/Vikhrmodels/salt-qwen2.5-0.5b-asr-tts)  | 1.09    | 0.18    | 23.09        |

---

#### **Our Solution**  
- **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks.  
- **Training**:  
  - BigCodec tokenizer (supports Slavic languages) for speech generation.
  - SpeechTokenizer (semantic tokens only) for speech recognition.  
  - Training time: **168 H100 GPU hours**.
- **Advantages**: Unified LM loss for dual tasks, minimal training overhead.


---

#### **Resources**  
- Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o)   

---