Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- openslr/librispeech_asr
|
4 |
+
- amphion/Emilia-Dataset
|
5 |
+
- its5Q/bigger-ru-book
|
6 |
+
- mozilla-foundation/common_voice_12_0
|
7 |
+
language:
|
8 |
+
- en
|
9 |
+
- ru
|
10 |
+
- uk
|
11 |
+
base_model:
|
12 |
+
- Qwen/Qwen2.5-0.5B
|
13 |
+
---
|
14 |
+
|
15 |
+
|
16 |
+
#### **Model Performance Overview**
|
17 |
+
**Metrics**:
|
18 |
+
- **PESQ**: Perceptual Evaluation of Speech Quality (higher = better).
|
19 |
+
- **STOI**: Short-Time Objective Intelligibility (closer to 1 = better).
|
20 |
+
- **SI-SDR**: Scale-Invariant Signal-to-Distortion Ratio (higher = better).
|
21 |
+
|
22 |
+
| Model | PESQ@200 | STOI@200 | SI-SDR@200 |
|
23 |
+
|---------------------------|----------------|---------------|-------------------|
|
24 |
+
| Fish-aduio-1.5 | 1.20 | 0.16 | 23.0 |
|
25 |
+
| **SALT-tts** | 1.11 | 0.16 | 23.58 |
|
26 |
+
| **SALT-tts+asr** | 1.09 | 0.18 | 23.09 |
|
27 |
+
|
28 |
+
---
|
29 |
+
|
30 |
+
#### **Our Solution**
|
31 |
+
- **Method**: Extends a pre-trained LLM with audio tokens and fine-tunes on **TTS** and **ASR** tasks.
|
32 |
+
- **Training**:
|
33 |
+
- BigCodec tokenizer (supports Slavic languages) for speech generation.
|
34 |
+
- SpeechTokenizer (semantic tokens only) for speech recognition.
|
35 |
+
- Training time: **168 H100 GPU hours**.
|
36 |
+
- **Advantages**: Unified LM loss for dual tasks, minimal training overhead.
|
37 |
+
|
38 |
+
|
39 |
+
---
|
40 |
+
|
41 |
+
#### **Resources**
|
42 |
+
- Code: [GitHub Repo](https://github.com/VikhrModels/Vikhr4o)
|
43 |
+
|
44 |
+
---
|